-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Use bfdot instruction in ARM bfloat16 dot product if compiling with it available #127488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127488
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 6 Unrelated FailuresAs of commit bf76663 with merge base 36e9b71 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| #ifdef __ARM_FEATURE_BF16 | ||
| return vbfdotq_f32(a, b, c); | ||
| #else | ||
| TORCH_CHECK(false, "unsupported operation used!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a static_assert False situation? I guess we want the function as a stub even if it's unavailable which precludes that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a static_assert False situation?
It should be statically dead in this situation; the TORCH_CHECK is just a backstop in case that isn't true in the future by accident.
| } | ||
|
|
||
| static C10_ALWAYS_INLINE float32x4_t f32_dot_bf16(float32x4_t a, bfloat16x8_t b, bfloat16x8_t c) { | ||
| #ifdef __ARM_FEATURE_BF16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To trigger it to true one needs to compile with --march=armv8+bf16
malfet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't see much perf changes:
Before
mv_nt torch.float32 4.99 usec
mv_nt torch.float16 9.70 usec
mv_nt torch.bfloat16 15.38 usec
mv_ta torch.float32 3.96 usec
mv_ta torch.float16 26.92 usec
mv_ta torch.bfloat16 156.50 usec
notrans torch.float32 3.96 usec
notrans torch.float16 28.00 usec
notrans torch.bfloat16 75.61 usec
trans_a torch.float32 3.94 usec
trans_a torch.float16 83.89 usec
trans_a torch.bfloat16 366.66 usec
trans_b torch.float32 7.44 usec
trans_b torch.float16 9.58 usec
trans_b torch.bfloat16 17.40 usec
after
mv_nt torch.float32 4.93 usec
mv_nt torch.float16 9.69 usec
mv_nt torch.bfloat16 15.18 usec
mv_ta torch.float32 3.84 usec
mv_ta torch.float16 26.41 usec
mv_ta torch.bfloat16 153.94 usec
notrans torch.float32 3.83 usec
notrans torch.float16 27.57 usec
notrans torch.bfloat16 74.61 usec
trans_a torch.float32 3.81 usec
trans_a torch.float16 83.84 usec
trans_a torch.bfloat16 366.30 usec
trans_b torch.float32 7.35 usec
trans_b torch.float16 9.50 usec
trans_b torch.bfloat16 17.38 usec
very surprising. I would recommend 2 things:
|
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
|
getting to the bottom of this now by benchmarking on phone |
Port of pytorch/pytorch#127488 . Differential Revision: [D62159105](https://our.internmc.facebook.com/intern/diff/D62159105/) [ghstack-poisoned]
…M_FEATURE_BF16" Port of pytorch/pytorch#127488 . Differential Revision: [D62159105](https://our.internmc.facebook.com/intern/diff/D62159105/) [ghstack-poisoned]
Port of pytorch/pytorch#127488 . Differential Revision: [D62159105](https://our.internmc.facebook.com/intern/diff/D62159105/) [ghstack-poisoned]
…M_FEATURE_BF16" Port of pytorch/pytorch#127488 . Differential Revision: [D62159105](https://our.internmc.facebook.com/intern/diff/D62159105/) [ghstack-poisoned]
Port of pytorch/pytorch#127488 . Differential Revision: [D62159105](https://our.internmc.facebook.com/intern/diff/D62159105/) [ghstack-poisoned]
…M_FEATURE_BF16" Port of pytorch/pytorch#127488 . Differential Revision: [D62159105](https://our.internmc.facebook.com/intern/diff/D62159105/) [ghstack-poisoned]
Port of pytorch/pytorch#127488 . Differential Revision: [D62159105](https://our.internmc.facebook.com/intern/diff/D62159105/) [ghstack-poisoned]
Summary: Pull Request resolved: #5249 Port of pytorch/pytorch#127488 . ghstack-source-id: 242493751 exported-using-ghexport Reviewed By: kimishpatel Differential Revision: D62159105 fbshipit-source-id: e8256ef21373ee5bbc58c15fa2dcf2753af839e0
|
this will be replaced by a PR to sync the latest work from executorch (including pytorch/executorch#5444 ), and hopefully follow-up to extract a file that can be identical between pytorch/pytorch and pytorch/executorch. |
ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) ghstack-source-id: 243596406 Pull Request resolved: #136331
|
replacement PR is #136331 |
…t back to ATen BlasKernel" ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
…lasKernel" ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
Pull Request resolved: #136331 ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) ghstack-source-id: 244185360
…t back to ATen BlasKernel" ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
…lasKernel" ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
Pull Request resolved: #136331 ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 . ghstack-source-id: 244187352 Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)
Pull Request resolved: #136331 ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 . ghstack-source-id: 245455630 Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)
…t back to ATen BlasKernel" ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
…lasKernel" ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
Pull Request resolved: #136331 ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 . ghstack-source-id: 245903000 Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)
…t back to ATen BlasKernel" ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
…lasKernel" ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) [ghstack-poisoned]
…136331) ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) Pull Request resolved: #136331 Approved by: https://github.com/malfet, https://github.com/albanD ghstack dependencies: #136445
Pull Request resolved: pytorch/executorch#5249 Port of pytorch/pytorch#127488 . ghstack-source-id: 242493751 @exported-using-ghexport Differential Revision: [D62159105](https://our.internmc.facebook.com/intern/diff/D62159105/)
Stack from ghstack (oldest at bottom):
Summary: We should use this intrinsic if the hardware supports it.
Test Plan: Currently untested because I don't have access to appropriate hardware.