Add support for NEON ISA in the Inductor C++ backend

### 🚀 The feature, motivation and pitch

Context: The TorchInductor C++ backend currently supports vectorization in C++ Codegen through two Intel ISAs: AVX2 and AVX512, as mentioned in the [Update 5 Blog](https://dev-discuss.pytorch.org/t/torchinductor-update-5-cpu-backend-backend-performance-update-and-deep-dive-on-key-optimizations/1117#vectorization-in-c-codegen-4). While the Aten library does support Arm as well, we are yet to leverage its NEON/SVE ISAs to generate optimized kernels. The blog also [mentions](https://dev-discuss.pytorch.org/t/torchinductor-update-5-cpu-backend-backend-performance-update-and-deep-dive-on-key-optimizations/1117#vectorization-in-c-codegen-4:~:text=It%20can%20be,sub%2Dclasses.) that the VecISA class can be subclassed in order to support other ISAs.

Proposal: I am working on providing NEON ISA support for the TorchInductor's C++ backend. Particularly, I intend to provide a NEON implementation of the `vec_reduce_all()` function, which currently has optimized [AVX2 and AVX512 intrinsics implementations](https://github.com/pytorch/pytorch/blob/ced5c89b6fbe827a538b7ada96b2f9a5989871c7/aten/src/ATen/cpu/vec/functional_base.h#L37-L79) for x86 processors introduced by @mingfeima in #73953, as well as a [slow path](https://github.com/pytorch/pytorch/blob/ced5c89b6fbe827a538b7ada96b2f9a5989871c7/aten/src/ATen/cpu/vec/functional_base.h#L12-L28) implementation for other processors including Arm. I have implemented a NEON version for the function, wired up the Inductor's generated C++ to invoke this NEON path on Arm CPUs & I've seen performance improvements, particularly in the Softmax operation.

Posting this here for any discussion before raising a PR.

### Alternatives

_No response_

### Additional context

_No response_

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @Xia-Weiwen @ngimel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for NEON ISA in the Inductor C++ backend #104729

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add support for NEON ISA in the Inductor C++ backend #104729

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions