Enable torch build with SLEEF on ARM by default #133339

aditew01 · 2024-08-13T18:22:13Z

Scope: Enable PyTorch build with SLEEF on Arm by default. Enable codegen kernels compilation with SLEEF on ARM platform.

Enabling the build with SLEEF by default and setting AT_BUILD_ARM_VEC256_WITH_SLEEF as the default for Arm improves performance for some models. I have benchmarked several networks on Neoverse-V1 using torch.compile with the inductor backend.
On models like hf_Bert_Large , hf_GPT_fast, we're seeing a ~1.2x speedup (with 16 threads).

The below results are run with Batch_Size=1 and Cores=8, 16

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @gujinghui @PenghuiCheng @jianyuh @min-jean-cho @yanbing-j @Guobing-Chen @Xia-Weiwen @snadampal @mcarilli @ptrblck @leslie-fang-intel @malfet @milpuz01 @EikanWang @voznesenskym @penguinwu @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn

pytorch-bot · 2024-08-13T18:22:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133339

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (5 Unrelated Failures)

As of commit 4fe7a60 with merge base 701ba52 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) (gh) (detected as infra flaky with no runner)
pull / linux-focal-py3.11-clang10 / test (dynamo, 3, 3, lf.linux.2xlarge) (gh) (disabled by #128551 but the issue was closed recently and a rebase is needed to make it pass)
test_dataloader.py::TestDataLoader::test_segfault
pull / linux-focal-py3.12-clang10 / test (dynamo, 2, 3, lf.linux.2xlarge) (gh) (disabled by #128551 but the issue was closed recently and a rebase is needed to make it pass)
test_dataloader.py::TestDataLoader::test_segfault
pull / linux-focal-py3.12-clang10-experimental-split-build / test (dynamo, 2, 3, linux.2xlarge) (gh) (disabled by #128551 but the issue was closed recently and a rebase is needed to make it pass)
test_dataloader.py::TestDataLoader::test_segfault
pull / linux-focal-py3.9-clang10 / test (dynamo, 3, 3, lf.linux.2xlarge) (gh) (disabled by #128551 but the issue was closed recently and a rebase is needed to make it pass)
test_dataloader.py::TestDataLoader::test_segfault

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2024-08-13T18:22:17Z

The committers listed above are authorized under a signed CLA.

✅ login: aditew01 (bce6630, c8c0c8b, 7a5e0a7, 4fe7a60, 1946800)
✅ login: malfet / name: Nikita Shulga (3dce9b9)

aditew01 · 2024-08-14T10:00:13Z

@pytorchbot drci

aditew01 · 2024-08-14T11:36:06Z

@pytorchbot label ciflow/linux-aarch64 module: arm

pytorch-bot · 2024-08-14T11:36:13Z

Can't add following labels to PR: ciflow/linux-aarch64. Please ping one of the reviewers for help.

aditew01 · 2024-08-14T11:40:55Z

cc: @malfet

cfRod · 2024-08-15T17:20:56Z

@pytorchbot label "ciflow/linux-aarch64"

cfRod · 2024-08-15T17:21:24Z

@pytorchbot label "module:arm"

pytorch-bot · 2024-08-15T17:21:31Z

Didn't find following labels among repository labels: module:arm

pytorch-bot · 2024-08-16T13:22:50Z

Please seek CI approval before scheduling CIFlow labels

aditew01 · 2024-08-21T09:45:27Z

@pytorchbot rebase

pytorch-bot · 2024-08-21T09:45:31Z

You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra.

aditew01 · 2024-08-21T10:55:33Z

@pytorchbot label "module: arm"

pytorch-bot · 2024-08-21T10:55:39Z

Didn't find following labels among repository labels: module:arm

aditew01 · 2024-08-21T10:59:52Z

@pytorchbot label "module: arm"

This reverts commit 7ce6726.

maajidkhann · 2024-09-19T06:44:04Z

We have also tested "SVE + Inductor flow" from @aditew01 PR: (#134672) without sleef and with sleef and observed consistent performance improvements with sleef build.

This change will further enhance default performance on ARM CPU's.

The below results are from torchbench on a 32 core Graviton 3 EC2 Instance:

changes LGTM!

abhishek-iitmadras · 2024-09-20T06:46:58Z

@pytorchbot merge -f 'All related PR tests are green'

pytorch-bot · 2024-09-20T06:47:01Z

You are not authorized to force merges to this repository. Please use the regular @pytorchmergebot merge command instead

pytorchmergebot · 2024-09-20T06:49:14Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-09-20T12:47:55Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

aditew01 · 2024-09-20T14:59:58Z

@malfet a naive question, do we re-trigger the mergebot?

malfet · 2024-09-20T15:21:49Z

@pytorchbot merge

malfet · 2024-09-20T15:22:49Z

@malfet a naive question, do we re-trigger the mergebot?

Yes, you should be able to, I was waiting for Android fix before issuing another merge command.

pytorchmergebot · 2024-09-20T15:23:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2024-09-20T16:00:26Z

@pytorchbot merge -f "No need to wait for torchbench runs"

pytorchmergebot · 2024-09-20T16:00:46Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorchmergebot · 2024-09-20T16:02:18Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Should have called [`Sleef_nextafterdx_sve`](https://sleef.org/2-references/libm/aarch64#vectorized-double-precision-function-for-obtaining-the-next-representable-fp-value) rather than [`Sleef_nextafterfx_sve`](https://sleef.org/2-references/libm/aarch64#vectorized-single-precision-function-for-obtaining-the-next-representable-fp-value) to get vectorized `nextafter` for double precision rather than single precision values This fixes a compilation issue introduced by #119571 and exposed by #133339 Pull Request resolved: #136388 Approved by: https://github.com/kit1980

Revert "[PT2][Inductor][Optmus] fix test_pad_mm_bf16 and reland to fix long computation kernel (#136349)" This reverts commit e184391. Revert "Fix clang-tidy warnings in torch/csrc/lazy (#134655)" This reverts commit 0287146. Revert "Remove duplicate line (#136383)" This reverts commit 0b91e7e. Revert "[TF32] Account for TF32 in `test_conv_double_backward` (#135716)" This reverts commit 29f7b8d. Revert "Fix `Vectorized<double>::next_after` SVE compilation (#136388)" This reverts commit 7936584. Revert "Upgrade pybind11 API calls for 3.13t (#136370)" This reverts commit 067d203. Revert "[AOTI][Tooling] Filter out kernels based off lowercase names (#135395)" This reverts commit 1a10751. Revert "Add decomps for max_unpool (#133146)" This reverts commit 0c936c3. Revert "add TORCH_CUDA_CPP_API for AutoNcclGroup (#130012)" This reverts commit 293fccf. Revert "Use cpython declaration of _PyWeakref_ClearRef (#136300)" This reverts commit d2455b9. Revert "fix mypi in utils/_sympy/functions.py (#136339)" This reverts commit 7f9c064. Revert "[Inductor] Fix test_profiler_mark_wrapper_call_cuda_cuda_wrapper (#136356)" This reverts commit f53a0f9. Revert "Add more distributed examples (#130427)" This reverts commit 5997354. Revert "return instead of using skipTest (#136244)" This reverts commit 29affa6. Reapply "[PT2/Profiler] Add Context Info to Torch-Compiled Regions (#132765)" This reverts commit 783c5ba. Revert "Enable torch build with SLEEF on ARM by default (#133339)" This reverts commit 4842f0f. Revert "[inductor] Relax the conditions for loop split (#135335)" This reverts commit 687e5cf. [ghstack-poisoned]

Revert "[PT2][Inductor][Optmus] fix test_pad_mm_bf16 and reland to fix long computation kernel (#136349)" This reverts commit e184391. Revert "Fix clang-tidy warnings in torch/csrc/lazy (#134655)" This reverts commit 0287146. Revert "Remove duplicate line (#136383)" This reverts commit 0b91e7e. Revert "[TF32] Account for TF32 in `test_conv_double_backward` (#135716)" This reverts commit 29f7b8d. Revert "Fix `Vectorized<double>::next_after` SVE compilation (#136388)" This reverts commit 7936584. Revert "Upgrade pybind11 API calls for 3.13t (#136370)" This reverts commit 067d203. Revert "[AOTI][Tooling] Filter out kernels based off lowercase names (#135395)" This reverts commit 1a10751. Revert "Add decomps for max_unpool (#133146)" This reverts commit 0c936c3. Revert "add TORCH_CUDA_CPP_API for AutoNcclGroup (#130012)" This reverts commit 293fccf. Revert "Use cpython declaration of _PyWeakref_ClearRef (#136300)" This reverts commit d2455b9. Revert "fix mypi in utils/_sympy/functions.py (#136339)" This reverts commit 7f9c064. Revert "[Inductor] Fix test_profiler_mark_wrapper_call_cuda_cuda_wrapper (#136356)" This reverts commit f53a0f9. Revert "Add more distributed examples (#130427)" This reverts commit 5997354. Revert "return instead of using skipTest (#136244)" This reverts commit 29affa6. Reapply "[PT2/Profiler] Add Context Info to Torch-Compiled Regions (#132765)" This reverts commit 783c5ba. Revert "Enable torch build with SLEEF on ARM by default (#133339)" This reverts commit 4842f0f. Revert "[inductor] Relax the conditions for loop split (#135335)" This reverts commit 687e5cf. ghstack-source-id: b0fb91e Pull Request resolved: #136668

blapie · 2024-09-27T07:47:06Z

Hello! SLEEF maintainer speaking here. I have a few questions regarding this PR.

Why does it have to be disabled on Android? Is there a plan to enable it?
What is compared in the benchmarks? Is it comparing against calls to standard scalar implementations (e.g. libc)? When SLEEF is enabled, is it calling to Neon or SVE routines?
I suppose only the high accuracy routines are used, since they have the typical GNU ABI names?
Is there room for relaxing accuracy and use the more competitive 3.5ULP (standard accuracy in glibc libmvec)?

pytorch-bot bot added the module: inductor label Aug 13, 2024

pytorchbot added the open source label Aug 13, 2024

janeyx99 requested a review from EikanWang August 14, 2024 00:35

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 14, 2024

malfet approved these changes Aug 15, 2024

View reviewed changes

pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Aug 15, 2024

aditew01 requested review from IvanYashchuk, lezcano and nikitaved as code owners August 16, 2024 13:13

pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: sparse release notes category labels Aug 16, 2024

aditew01 force-pushed the aditew01/arm_sleef branch from 7712030 to 676c737 Compare August 16, 2024 13:22

pytorch-bot bot added the ciflow/inductor label Aug 16, 2024

pytorch-bot bot removed the ciflow/inductor label Aug 16, 2024

pytorch-bot bot added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Aug 21, 2024

aditew01 and others added 4 commits September 18, 2024 17:44

o Refactor appending isa_list across different ARM platforms

1946800

Undo recent change

3dce9b9

Revert "o Enable torch build with SLEEF on ARM by default"

bce6630

This reverts commit 7ce6726.

o Refactor build to skip SLEEF on Android platform

c8c0c8b

aditew01 force-pushed the aditew01/arm_sleef branch from c653b9a to c8c0c8b Compare September 18, 2024 17:44

* fix linter error

4fe7a60

pytorchmergebot added the merging label Sep 20, 2024

malfet approved these changes Sep 20, 2024

View reviewed changes

pytorchmergebot added the Merged label Sep 20, 2024

pytorchmergebot closed this in 4842f0f Sep 20, 2024

pytorchmergebot removed the merging label Sep 20, 2024

malfet mentioned this pull request Sep 20, 2024

Fix Vectorized<double>::next_after SVE compilation #136388

Closed

zou3519 mentioned this pull request Sep 25, 2024

Revert a bunch of stuff #136668

Closed

Enable torch build with SLEEF on ARM by default #133339

Enable torch build with SLEEF on ARM by default #133339

Uh oh!

Conversation

aditew01 commented Aug 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133339

✅ You can merge normally! (5 Unrelated Failures)

Uh oh!

linux-foundation-easycla bot commented Aug 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aditew01 commented Aug 14, 2024

Uh oh!

aditew01 commented Aug 14, 2024

Uh oh!

pytorch-bot bot commented Aug 14, 2024

Uh oh!

aditew01 commented Aug 14, 2024

Uh oh!

cfRod commented Aug 15, 2024

Uh oh!

cfRod commented Aug 15, 2024

Uh oh!

pytorch-bot bot commented Aug 15, 2024

Uh oh!

pytorch-bot bot commented Aug 16, 2024

Uh oh!

aditew01 commented Aug 21, 2024

Uh oh!

pytorch-bot bot commented Aug 21, 2024

Uh oh!

aditew01 commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 21, 2024

Uh oh!

aditew01 commented Aug 21, 2024

Uh oh!

maajidkhann commented Sep 19, 2024

Uh oh!

abhishek-iitmadras commented Sep 20, 2024

Uh oh!

pytorch-bot bot commented Sep 20, 2024

Uh oh!

pytorchmergebot commented Sep 20, 2024

Merge started

Uh oh!

pytorchmergebot commented Sep 20, 2024

Uh oh!

aditew01 commented Sep 20, 2024

Uh oh!

malfet commented Sep 20, 2024

Uh oh!

malfet commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorchmergebot commented Sep 20, 2024

Merge started

Uh oh!

malfet commented Sep 20, 2024

Uh oh!

pytorchmergebot commented Sep 20, 2024

Uh oh!

pytorchmergebot commented Sep 20, 2024

Merge started

Uh oh!

blapie commented Sep 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

aditew01 commented Aug 13, 2024 •

edited

Loading

pytorch-bot bot commented Aug 13, 2024 •

edited

Loading

linux-foundation-easycla bot commented Aug 13, 2024 •

edited

Loading

aditew01 commented Aug 21, 2024 •

edited

Loading

malfet commented Sep 20, 2024 •

edited

Loading