Skip to content

Conversation

@aditew01
Copy link
Collaborator

@aditew01 aditew01 commented Aug 13, 2024

Scope: Enable PyTorch build with SLEEF on Arm by default. Enable codegen kernels compilation with SLEEF on ARM platform.

Enabling the build with SLEEF by default and setting AT_BUILD_ARM_VEC256_WITH_SLEEF as the default for Arm improves performance for some models. I have benchmarked several networks on Neoverse-V1 using torch.compile with the inductor backend.
On models like hf_Bert_Large , hf_GPT_fast, we're seeing a ~1.2x speedup (with 16 threads).

The below results are run with Batch_Size=1 and Cores=8, 16

Screenshot 2024-08-27 at 17 04 23

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @gujinghui @PenghuiCheng @jianyuh @min-jean-cho @yanbing-j @Guobing-Chen @Xia-Weiwen @snadampal @mcarilli @ptrblck @leslie-fang-intel @malfet @milpuz01 @EikanWang @voznesenskym @penguinwu @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133339

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (5 Unrelated Failures)

As of commit 4fe7a60 with merge base 701ba52 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Aug 13, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@janeyx99 janeyx99 requested a review from EikanWang August 14, 2024 00:35
@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 14, 2024
@aditew01
Copy link
Collaborator Author

@pytorchbot drci

@aditew01
Copy link
Collaborator Author

@pytorchbot label ciflow/linux-aarch64 module: arm

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 14, 2024

Can't add following labels to PR: ciflow/linux-aarch64. Please ping one of the reviewers for help.

@aditew01
Copy link
Collaborator Author

cc: @malfet

@cfRod
Copy link
Collaborator

cfRod commented Aug 15, 2024

@pytorchbot label "ciflow/linux-aarch64"

@pytorch-bot pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Aug 15, 2024
@cfRod
Copy link
Collaborator

cfRod commented Aug 15, 2024

@pytorchbot label "module:arm"

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 15, 2024

Didn't find following labels among repository labels: module:arm

@pytorch-bot pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: sparse release notes category labels Aug 16, 2024
@aditew01 aditew01 force-pushed the aditew01/arm_sleef branch from 7712030 to 676c737 Compare August 16, 2024 13:22
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 16, 2024

Please seek CI approval before scheduling CIFlow labels

@aditew01
Copy link
Collaborator Author

@pytorchbot rebase

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 21, 2024

You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra.

@aditew01
Copy link
Collaborator Author

aditew01 commented Aug 21, 2024

@pytorchbot label "module: arm"

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 21, 2024

Didn't find following labels among repository labels: module:arm

@aditew01
Copy link
Collaborator Author

@pytorchbot label "module: arm"

@pytorch-bot pytorch-bot bot added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Aug 21, 2024
@maajidkhann
Copy link
Contributor

We have also tested "SVE + Inductor flow" from @aditew01 PR: (#134672) without sleef and with sleef and observed consistent performance improvements with sleef build.

This change will further enhance default performance on ARM CPU's.

The below results are from torchbench on a 32 core Graviton 3 EC2 Instance:

image

changes LGTM!

@abhishek-iitmadras
Copy link
Collaborator

@pytorchbot merge -f 'All related PR tests are green'

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 20, 2024

You are not authorized to force merges to this repository. Please use the regular @pytorchmergebot merge command instead

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@aditew01
Copy link
Collaborator Author

@malfet a naive question, do we re-trigger the mergebot?

@malfet
Copy link
Contributor

malfet commented Sep 20, 2024

@pytorchbot merge

@malfet
Copy link
Contributor

malfet commented Sep 20, 2024

@malfet a naive question, do we re-trigger the mergebot?

Yes, you should be able to, I was waiting for Android fix before issuing another merge command.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@malfet
Copy link
Contributor

malfet commented Sep 20, 2024

@pytorchbot merge -f "No need to wait for torchbench runs"

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

zou3519 added a commit that referenced this pull request Sep 25, 2024
Revert "[PT2][Inductor][Optmus] fix test_pad_mm_bf16 and reland to fix long computation kernel (#136349)"

This reverts commit e184391.

Revert "Fix clang-tidy warnings in torch/csrc/lazy (#134655)"

This reverts commit 0287146.

Revert "Remove duplicate line (#136383)"

This reverts commit 0b91e7e.

Revert "[TF32] Account for TF32 in `test_conv_double_backward` (#135716)"

This reverts commit 29f7b8d.

Revert "Fix `Vectorized<double>::next_after` SVE compilation (#136388)"

This reverts commit 7936584.

Revert "Upgrade pybind11 API calls for 3.13t (#136370)"

This reverts commit 067d203.

Revert "[AOTI][Tooling] Filter out kernels based off lowercase names (#135395)"

This reverts commit 1a10751.

Revert "Add decomps for max_unpool (#133146)"

This reverts commit 0c936c3.

Revert "add TORCH_CUDA_CPP_API for AutoNcclGroup (#130012)"

This reverts commit 293fccf.

Revert "Use cpython declaration of _PyWeakref_ClearRef (#136300)"

This reverts commit d2455b9.

Revert "fix mypi in utils/_sympy/functions.py (#136339)"

This reverts commit 7f9c064.

Revert "[Inductor] Fix test_profiler_mark_wrapper_call_cuda_cuda_wrapper (#136356)"

This reverts commit f53a0f9.

Revert "Add more distributed examples (#130427)"

This reverts commit 5997354.

Revert "return instead of using skipTest (#136244)"

This reverts commit 29affa6.

Reapply "[PT2/Profiler] Add Context Info to Torch-Compiled Regions (#132765)"

This reverts commit 783c5ba.

Revert "Enable torch build with SLEEF on ARM by default (#133339)"

This reverts commit 4842f0f.

Revert "[inductor] Relax the conditions for loop split (#135335)"

This reverts commit 687e5cf.

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Sep 25, 2024
Revert "[PT2][Inductor][Optmus] fix test_pad_mm_bf16 and reland to fix long computation kernel (#136349)"

This reverts commit e184391.

Revert "Fix clang-tidy warnings in torch/csrc/lazy (#134655)"

This reverts commit 0287146.

Revert "Remove duplicate line (#136383)"

This reverts commit 0b91e7e.

Revert "[TF32] Account for TF32 in `test_conv_double_backward` (#135716)"

This reverts commit 29f7b8d.

Revert "Fix `Vectorized<double>::next_after` SVE compilation (#136388)"

This reverts commit 7936584.

Revert "Upgrade pybind11 API calls for 3.13t (#136370)"

This reverts commit 067d203.

Revert "[AOTI][Tooling] Filter out kernels based off lowercase names (#135395)"

This reverts commit 1a10751.

Revert "Add decomps for max_unpool (#133146)"

This reverts commit 0c936c3.

Revert "add TORCH_CUDA_CPP_API for AutoNcclGroup (#130012)"

This reverts commit 293fccf.

Revert "Use cpython declaration of _PyWeakref_ClearRef (#136300)"

This reverts commit d2455b9.

Revert "fix mypi in utils/_sympy/functions.py (#136339)"

This reverts commit 7f9c064.

Revert "[Inductor] Fix test_profiler_mark_wrapper_call_cuda_cuda_wrapper (#136356)"

This reverts commit f53a0f9.

Revert "Add more distributed examples (#130427)"

This reverts commit 5997354.

Revert "return instead of using skipTest (#136244)"

This reverts commit 29affa6.

Reapply "[PT2/Profiler] Add Context Info to Torch-Compiled Regions (#132765)"

This reverts commit 783c5ba.

Revert "Enable torch build with SLEEF on ARM by default (#133339)"

This reverts commit 4842f0f.

Revert "[inductor] Relax the conditions for loop split (#135335)"

This reverts commit 687e5cf.

ghstack-source-id: b0fb91e
Pull Request resolved: #136668
@blapie
Copy link

blapie commented Sep 27, 2024

Hello! SLEEF maintainer speaking here. I have a few questions regarding this PR.

  1. Why does it have to be disabled on Android? Is there a plan to enable it?
  2. What is compared in the benchmarks? Is it comparing against calls to standard scalar implementations (e.g. libc)? When SLEEF is enabled, is it calling to Neon or SVE routines?
  3. I suppose only the high accuracy routines are used, since they have the typical GNU ABI names?
  4. Is there room for relaxing accuracy and use the more competitive 3.5ULP (standard accuracy in glibc libmvec)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request Merged module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: inductor module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration open source release notes: build release notes category topic: improvements topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.