-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[Inductor] Add support for NEON ISA in the Inductor C++ backend #105590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/105590
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit e5628d0 with merge base 24d5cab ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
jgong5
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, will stamp after we make sure the changes are covered by UT and the PR passes the UT.
|
Thanks for the review @jgong5 Let me know if you'd like me to merge those changes into this PR, or keep them separate. |
How about make them separate and let the other PR land first? |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
@Rohanjames1997 please be careful with ignoring, you ignored a lint failure. |
Pull Request resolved: #120461 Approved by: https://github.com/Skylion007
|
@pytorchbot revert -m "#121288 (comment)" |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot revert -m "#121288 (comment)" -c nosignal |
|
@pytorchbot successfully started a revert job. Check the current status here. |
…nd (#105590)" This reverts commit 156954d. Reverted #105590 on behalf of https://github.com/ezyang due to #121288 (comment) ([comment](#105590 (comment)))
|
@Rohanjames1997 your PR has been successfully reverted. |
This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.
…rch#105590) Fixes pytorch#104729 As suggested in the [blog](https://dev-discuss.pytorch.org/t/torchinductor-update-5-cpu-backend-backend-performance-update-and-deep-dive-on-key-optimizations/1117#:~:text=It%20can%20be,sub%2Dclasses.), I subclassed the `VecISA` class and implemented a NEON version of the `vec_reduce_all()` function, to go along with the existing AVX2 and AVX512 versions. Any operation that calls `vec_reduce_all()` will also take the NEON path and benefit from its vectorization. The `vec_reduce_all()` is invoked by Softmax and other operations like norms. Using the fast path results in 30% time savings for Softmax as compared to the previously taken slow path. | Slow path | Fast path (NEON intrinsics) -- | -- | -- Softmax (100 passes, 1024 dimension) | 623.706ms | 452.011ms Pull Request resolved: pytorch#105590 Approved by: https://github.com/jgong5, https://github.com/malfet
| #include <c10/util/TypeCast.h> | ||
|
|
||
| #if defined(CPU_CAPABILITY_AVX512) || defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_ZVECTOR) | ||
| #if defined(CPU_CAPABILITY_AVX512) || defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_ZVECTOR) || defined(CPU_CAPABILITY_NEON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all # elif defined(CPU_CAPABILITY_ZVECTOR) in the masked_load functions in this file (e.g., https://github.com/pytorch/pytorch/pull/105590/files#diff-e384e0d2829ef483854a45c4f422979d1a2ca28495c7bc06e91eabc67c61a470R320), change them to # else so that the default path can work for NEON too.
This is a re-land of #105590 but this time enbaling it only for Darwin platform where those instructions are available by default
This is a re-land of #105590 but this time enbaling it only for Darwin platform where those instructions are available by default
This is a re-land of #105590 but this time enbaling it only for Darwin platform where those instructions are available by default
This is a re-land of #105590 but this time enbaling it only for Darwin platform where those instructions are available by default
This started as a re-land of #105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions) Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS: - #122511 - #122513 - #122580 - #122608 Following was added/changed to enable vectorization code to work on MacOS - Added VecNEON class to `_inductor/codecache.py` that is supported on all AppleSilicon Macs - Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types - Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see #118149 for more details) See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro: | dtype | Eager | Compile (before) | Compile (after) | | ------ | ------ | --------- | --------- | | bfloat16 | 120 tokens/sec | 130 tokens/sec | 156 tokens/sec | | float32 | 158 tokens/sec | 140 tokens/sec | 236 tokens/sec | | float16 | 235 tokens/sec | 81 tokens/sec | 58 tokens/sec | Pull Request resolved: #122217 Approved by: https://github.com/jansel
|
Refer to #123584 |
This started as a re-land of #105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions) Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS: - #122511 - #122513 - #122580 - #122608 Following was added/changed to enable vectorization code to work on MacOS - Added VecNEON class to `_inductor/codecache.py` that is supported on all AppleSilicon Macs - Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types - Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see #118149 for more details) See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro: | dtype | Eager | Compile (before) | Compile (after) | | ------ | ------ | --------- | --------- | | bfloat16 | 120 tokens/sec | 130 tokens/sec | 156 tokens/sec | | float32 | 158 tokens/sec | 140 tokens/sec | 236 tokens/sec | | float16 | 235 tokens/sec | 81 tokens/sec | 58 tokens/sec | Pull Request resolved: #122217 Approved by: https://github.com/jansel
Fixes #104729
As suggested in the blog, I subclassed the
VecISAclass and implemented a NEON version of thevec_reduce_all()function, to go along with the existing AVX2 and AVX512 versions. Any operation that callsvec_reduce_all()will also take the NEON path and benefit from its vectorization.The
vec_reduce_all()is invoked by Softmax and other operations like norms. Using the fast path results in 30% time savings for Softmax as compared to the previously taken slow path.cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang @Xia-Weiwen @ngimel @malfet