Skip to content

Conversation

@swolchok
Copy link
Contributor

@swolchok swolchok commented May 30, 2024

Stack from ghstack (oldest at bottom):

Summary: We should use this intrinsic if the hardware supports it.

Test Plan: Currently untested because I don't have access to appropriate hardware.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented May 30, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127488

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 6 Unrelated Failures

As of commit bf76663 with merge base 36e9b71 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

swolchok added a commit that referenced this pull request May 30, 2024
…t available

Summary: We should use this intrinsic if the hardware supports it.

Test Plan: Currently untested because I don't have access to appropriate hardware.

ghstack-source-id: e8b7538
Pull Request resolved: #127488
@swolchok swolchok requested a review from malfet May 30, 2024 00:12
@swolchok swolchok added the topic: performance topic category label May 30, 2024
#ifdef __ARM_FEATURE_BF16
return vbfdotq_f32(a, b, c);
#else
TORCH_CHECK(false, "unsupported operation used!");
Copy link
Collaborator

@Skylion007 Skylion007 Jun 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a static_assert False situation? I guess we want the function as a stub even if it's unavailable which precludes that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a static_assert False situation?

It should be statically dead in this situation; the TORCH_CHECK is just a backstop in case that isn't true in the future by accident.

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Jun 3, 2024
…t available

Summary: We should use this intrinsic if the hardware supports it.

Test Plan: Currently untested because I don't have access to appropriate hardware.

ghstack-source-id: 2444407
Pull Request resolved: #127488
@malfet malfet added ciflow/trunk Trigger trunk jobs on your pull request ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/mps Run MPS tests (subset of trunk) labels Jun 3, 2024
[ghstack-poisoned]
}

static C10_ALWAYS_INLINE float32x4_t f32_dot_bf16(float32x4_t a, bfloat16x8_t b, bfloat16x8_t c) {
#ifdef __ARM_FEATURE_BF16
Copy link
Contributor

@malfet malfet Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To trigger it to true one needs to compile with --march=armv8+bf16

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't see much perf changes:
Before

mv_nt    torch.float32    4.99 usec
mv_nt    torch.float16    9.70 usec
mv_nt   torch.bfloat16   15.38 usec
mv_ta    torch.float32    3.96 usec
mv_ta    torch.float16   26.92 usec
mv_ta   torch.bfloat16  156.50 usec
notrans  torch.float32    3.96 usec
notrans  torch.float16   28.00 usec
notrans torch.bfloat16   75.61 usec
trans_a  torch.float32    3.94 usec
trans_a  torch.float16   83.89 usec
trans_a torch.bfloat16  366.66 usec
trans_b  torch.float32    7.44 usec
trans_b  torch.float16    9.58 usec
trans_b torch.bfloat16   17.40 usec

after

mv_nt    torch.float32    4.93 usec
mv_nt    torch.float16    9.69 usec
mv_nt   torch.bfloat16   15.18 usec
mv_ta    torch.float32    3.84 usec
mv_ta    torch.float16   26.41 usec
mv_ta   torch.bfloat16  153.94 usec
notrans  torch.float32    3.83 usec
notrans  torch.float16   27.57 usec
notrans torch.bfloat16   74.61 usec
trans_a  torch.float32    3.81 usec
trans_a  torch.float16   83.84 usec
trans_a torch.bfloat16  366.30 usec
trans_b  torch.float32    7.35 usec
trans_b  torch.float16    9.50 usec
trans_b torch.bfloat16   17.38 usec

@swolchok
Copy link
Contributor Author

swolchok commented Jun 5, 2024

Hmm, I don't see much perf changes:

very surprising. I would recommend 2 things:

  1. profile with instruments and make sure bfdot is actually getting emitted -- xctrace record --template 'Time Profiler' --launch -- /Users/swolchok/homebrew/Caskroom/miniconda/base/envs/pytorch/bin/python3 benchmarks/benchmark_torch_mm.py llm (replace with path to your python3)
  2. try the larger benchmark cases I sent a PR for: Add LLM benchmark case to benchmark_torch_mm.py malfet/llm_experiments#8

@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2024

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Aug 4, 2024
@swolchok
Copy link
Contributor Author

swolchok commented Aug 26, 2024

@malfet it looks like your benchmark results above may have been missing 6f18175 (#127318) . Note that your trans_a results match the "Before" and not the "After" on that commit.

@swolchok
Copy link
Contributor Author

getting to the bottom of this now by benchmarking on phone

facebook-github-bot pushed a commit to pytorch/executorch that referenced this pull request Sep 14, 2024
Summary:
Pull Request resolved: #5249

Port of pytorch/pytorch#127488 .
ghstack-source-id: 242493751
exported-using-ghexport

Reviewed By: kimishpatel

Differential Revision: D62159105

fbshipit-source-id: e8256ef21373ee5bbc58c15fa2dcf2753af839e0
@swolchok
Copy link
Contributor Author

this will be replaced by a PR to sync the latest work from executorch (including pytorch/executorch#5444 ), and hopefully follow-up to extract a file that can be identical between pytorch/pytorch and pytorch/executorch.

@swolchok swolchok closed this Sep 19, 2024
swolchok added a commit that referenced this pull request Sep 19, 2024
ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Sep 19, 2024
ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)

ghstack-source-id: 243596406
Pull Request resolved: #136331
@swolchok
Copy link
Contributor Author

replacement PR is #136331

swolchok added a commit that referenced this pull request Sep 23, 2024
…t back to ATen BlasKernel"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Sep 23, 2024
…lasKernel"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Sep 23, 2024
Pull Request resolved: #136331

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)
ghstack-source-id: 244185360
swolchok added a commit that referenced this pull request Sep 23, 2024
…t back to ATen BlasKernel"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Sep 23, 2024
…lasKernel"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Sep 23, 2024
Pull Request resolved: #136331

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 .
ghstack-source-id: 244187352

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)
swolchok added a commit that referenced this pull request Sep 30, 2024
Pull Request resolved: #136331

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 .
ghstack-source-id: 245455630

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)
swolchok added a commit that referenced this pull request Sep 30, 2024
…t back to ATen BlasKernel"


ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Sep 30, 2024
…lasKernel"


ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 2, 2024
Pull Request resolved: #136331

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 .
ghstack-source-id: 245903000

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)
swolchok added a commit that referenced this pull request Oct 2, 2024
…t back to ATen BlasKernel"


ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 2, 2024
…lasKernel"


ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)

[ghstack-poisoned]
@github-actions github-actions bot deleted the gh/swolchok/638/head branch October 20, 2024 02:09
kedarnath03 pushed a commit to kedarnath03/executorch that referenced this pull request Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/mps Run MPS tests (subset of trunk) ciflow/trunk Trigger trunk jobs on your pull request Stale topic: performance topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants