[Inductor] Add support for NEON ISA in the Inductor C++ backend #105590

Rohanjames1997 · 2023-07-19T19:39:10Z

As suggested in the blog, I subclassed the VecISA class and implemented a NEON version of the vec_reduce_all() function, to go along with the existing AVX2 and AVX512 versions. Any operation that calls vec_reduce_all() will also take the NEON path and benefit from its vectorization.

The vec_reduce_all() is invoked by Softmax and other operations like norms. Using the fast path results in 30% time savings for Softmax as compared to the previously taken slow path.

	Slow path	Fast path (NEON intrinsics)
Softmax (100 passes, 1024 dimension)	623.706ms	452.011ms

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang @Xia-Weiwen @ngimel @malfet

pytorch-bot · 2023-07-19T19:39:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/105590

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit e5628d0 with merge base 24d5cab ():

NEW FAILURE - The following job has failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/_inductor/codecache.py:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2023-07-19T19:39:15Z

The committers listed above are authorized under a signed CLA.

✅ login: Rohanjames1997 (69ebefb, 8c39e4e, b2cbedf, 80097cd, 7ec4123, 0a28928, 0ecc712, 157cc81, 75291cf, d313b83, b7fecf4, 51f853e, 3dc6e85, eb01197, e5628d0, e3a509f, 0b52b1d)

github-actions · 2023-07-19T19:39:53Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

aten/src/ATen/cpu/vec/functional_base.h

jgong5

Overall LGTM, will stamp after we make sure the changes are covered by UT and the PR passes the UT.

aten/src/ATen/cpu/vec/functional_base.h

Rohanjames1997 · 2023-07-21T21:52:06Z

Thanks for the review @jgong5
I decided to extend the existing UT infrastructure for NEON as well. Since my changes involve editing some critical cmake files as well, I've raised the changes in a different PR and my code passes the UT vec_test_all_types_NEON.

Let me know if you'd like me to merge those changes into this PR, or keep them separate.
Thanks!

jgong5 · 2023-07-24T01:09:54Z

Let me know if you'd like me to merge those changes into this PR, or keep them separate.

How about make them separate and let the other PR land first?

Rohanjames1997 · 2023-07-24T05:17:47Z

@jgong5 , Sure I have raised the changes exclusively required to create a UT in this PR: #105823

github-actions · 2023-10-08T00:48:37Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

kit1980 · 2024-02-23T01:14:50Z

@pytorchmergebot merge -i

@Rohanjames1997 please be careful with ignoring, you ignored a lint failure.

Pull Request resolved: #120461 Approved by: https://github.com/Skylion007

ezyang · 2024-03-07T23:04:32Z

@pytorchbot revert -m "#121288 (comment)"

pytorch-bot · 2024-03-07T23:04:34Z

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

ezyang · 2024-03-07T23:04:50Z

@pytorchbot revert -m "#121288 (comment)" -c nosignal

pytorchmergebot · 2024-03-07T23:06:24Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…nd (#105590)" This reverts commit 156954d. Reverted #105590 on behalf of https://github.com/ezyang due to #121288 (comment) ([comment](#105590 (comment)))

pytorchmergebot · 2024-03-07T23:06:33Z

@Rohanjames1997 your PR has been successfully reverted.

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

…rch#105590) Fixes pytorch#104729 As suggested in the [blog](https://dev-discuss.pytorch.org/t/torchinductor-update-5-cpu-backend-backend-performance-update-and-deep-dive-on-key-optimizations/1117#:~:text=It%20can%20be,sub%2Dclasses.), I subclassed the `VecISA` class and implemented a NEON version of the `vec_reduce_all()` function, to go along with the existing AVX2 and AVX512 versions. Any operation that calls `vec_reduce_all()` will also take the NEON path and benefit from its vectorization. The `vec_reduce_all()` is invoked by Softmax and other operations like norms. Using the fast path results in 30% time savings for Softmax as compared to the previously taken slow path. | Slow path | Fast path (NEON intrinsics) -- | -- | -- Softmax (100 passes, 1024 dimension) | 623.706ms | 452.011ms Pull Request resolved: pytorch#105590 Approved by: https://github.com/jgong5, https://github.com/malfet

jgong5 · 2024-03-13T05:44:42Z

torch/_inductor/codegen/cpp_prefix.h

 #include <c10/util/TypeCast.h>

-#if defined(CPU_CAPABILITY_AVX512) || defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_ZVECTOR)
+#if defined(CPU_CAPABILITY_AVX512) || defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_ZVECTOR) || defined(CPU_CAPABILITY_NEON)


For all # elif defined(CPU_CAPABILITY_ZVECTOR) in the masked_load functions in this file (e.g., https://github.com/pytorch/pytorch/pull/105590/files#diff-e384e0d2829ef483854a45c4f422979d1a2ca28495c7bc06e91eabc67c61a470R320), change them to # else so that the default path can work for NEON too.

This is a re-land of #105590 but this time enbaling it only for Darwin platform where those instructions are available by default

This started as a re-land of #105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions) Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS: - #122511 - #122513 - #122580 - #122608 Following was added/changed to enable vectorization code to work on MacOS - Added VecNEON class to `_inductor/codecache.py` that is supported on all AppleSilicon Macs - Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types - Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see #118149 for more details) See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro: | dtype | Eager | Compile (before) | Compile (after) | | ------ | ------ | --------- | --------- | | bfloat16 | 120 tokens/sec | 130 tokens/sec | 156 tokens/sec | | float32 | 158 tokens/sec | 140 tokens/sec | 236 tokens/sec | | float16 | 235 tokens/sec | 81 tokens/sec | 58 tokens/sec | Pull Request resolved: #122217 Approved by: https://github.com/jansel

Rohanjames1997 · 2024-04-08T20:46:23Z

Refer to #123584

This started as a re-land of #105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions) Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS: - #122511 - #122513 - #122580 - #122608 Following was added/changed to enable vectorization code to work on MacOS - Added VecNEON class to `_inductor/codecache.py` that is supported on all AppleSilicon Macs - Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types - Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see #118149 for more details) See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro: | dtype | Eager | Compile (before) | Compile (after) | | ------ | ------ | --------- | --------- | | bfloat16 | 120 tokens/sec | 130 tokens/sec | 156 tokens/sec | | float32 | 158 tokens/sec | 140 tokens/sec | 236 tokens/sec | | float16 | 235 tokens/sec | 81 tokens/sec | 58 tokens/sec | Pull Request resolved: #122217 Approved by: https://github.com/jansel

Rohanjames1997 and others added 9 commits July 10, 2023 14:56

Add Neon support for vec_reduce_all

69ebefb

Resolve conflicts

51f853e

Add support for NEON

0a28928

Illustrate the underlying shuffle

157cc81

Add comment about the need for 256 bit width

75291cf

Merge branch 'pytorch:main' into neon_vec_reduce_all

b7fecf4

Merge branch 'pytorch:main' into neon_vec_reduce_all

eb01197

Remove unnecessary shuffles. Update comments.

0b52b1d

Merge branch 'pytorch:main' into neon_vec_reduce_all

80097cd

github-actions bot added module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor ciflow/inductor labels Jul 19, 2023

pytorchbot added the open source label Jul 19, 2023

Rohanjames1997 commented Jul 19, 2023

View reviewed changes

aten/src/ATen/cpu/vec/functional_base.h Show resolved Hide resolved

jgong5 reviewed Jul 20, 2023

View reviewed changes

aten/src/ATen/cpu/vec/functional_base.h Show resolved Hide resolved

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 20, 2023

Rohanjames1997 and others added 3 commits July 20, 2023 17:49

Merge branch 'pytorch:main' into neon_vec_reduce_all

e3a509f

Merge branch 'pytorch:main' into neon_vec_reduce_all

d313b83

Update test_cpu_repro to include NEON ISA

b2cbedf

Rohanjames1997 requested a review from jgong5 July 21, 2023 21:52

Rohanjames1997 mentioned this pull request Jul 21, 2023

Add UT for NEON implementation of vec_reduce_all #105763

Closed

Rohanjames1997 mentioned this pull request Jul 24, 2023

Add UT for NEON implementation of vec_reduce_all #105823

Closed

Merge branch 'pytorch:main' into neon_vec_reduce_all

7ec4123

pytorchmergebot added the Merged label Feb 22, 2024

pytorchmergebot closed this in 156954d Feb 22, 2024

pytorchmergebot removed the merging label Feb 22, 2024

kit1980 added a commit that referenced this pull request Feb 23, 2024

Fix lint after #105590

445a3ff

pytorchmergebot pushed a commit that referenced this pull request Feb 23, 2024

Fix lint after #105590 (#120461)

bb6f509

Pull Request resolved: #120461 Approved by: https://github.com/Skylion007

Rohanjames1997 mentioned this pull request Mar 6, 2024

[aarch64 linux] torch.compile() crashes on aarch64 linux with nightly torch wheel #121288

Closed

pytorchmergebot added the Reverted label Mar 7, 2024

pytorchmergebot reopened this Mar 7, 2024

Rohanjames1997 requested review from jgong5 and malfet March 12, 2024 17:21

jgong5 requested changes Mar 13, 2024

View reviewed changes

malfet pushed a commit that referenced this pull request Mar 19, 2024

[Reland][Inductor] Add support for NEON ISA For Mac OS

0ba271e

This is a re-land of #105590 but this time enbaling it only for Darwin platform where those instructions are available by default

malfet mentioned this pull request Mar 19, 2024

[Inductor] Add NEON ISA support on arm64 Macs #122217

Closed

malfet pushed a commit that referenced this pull request Mar 22, 2024

[Reland][Inductor] Add support for NEON ISA For Mac OS

5b1de49

This is a re-land of #105590 but this time enbaling it only for Darwin platform where those instructions are available by default

malfet pushed a commit that referenced this pull request Mar 25, 2024

[Reland][Inductor] Add support for NEON ISA For Mac OS

d26711a

This is a re-land of #105590 but this time enbaling it only for Darwin platform where those instructions are available by default

malfet added a commit that referenced this pull request Mar 25, 2024

[Reland][Inductor] Add support for NEON ISA For Mac OS

c6f3d57

This is a re-land of #105590 but this time enbaling it only for Darwin platform where those instructions are available by default

Rohanjames1997 closed this Apr 8, 2024

[Inductor] Add support for NEON ISA in the Inductor C++ backend #105590

[Inductor] Add support for NEON ISA in the Inductor C++ backend #105590

Uh oh!

Conversation

Rohanjames1997 commented Jul 19, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/105590

❌ 1 New Failure

Uh oh!

linux-foundation-easycla bot commented Jul 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 19, 2023

This PR needs a release notes: label

Uh oh!

Uh oh!

jgong5 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Rohanjames1997 commented Jul 21, 2023

Uh oh!

jgong5 commented Jul 24, 2023

Uh oh!

Rohanjames1997 commented Jul 24, 2023

Uh oh!

github-actions bot commented Oct 8, 2023

Uh oh!

kit1980 commented Feb 23, 2024

Uh oh!

ezyang commented Mar 7, 2024

Uh oh!

pytorch-bot bot commented Mar 7, 2024

Uh oh!

ezyang commented Mar 7, 2024

Uh oh!

pytorchmergebot commented Mar 7, 2024

Uh oh!

pytorchmergebot commented Mar 7, 2024

Uh oh!

jgong5 Mar 13, 2024

Choose a reason for hiding this comment

Uh oh!

Rohanjames1997 commented Apr 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Rohanjames1997 commented Jul 19, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 19, 2023 •

edited

Loading

linux-foundation-easycla bot commented Jul 19, 2023 •

edited

Loading

This PR needs a `release notes:` label