-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🚀 The feature, motivation and pitch
Context: The TorchInductor C++ backend currently supports vectorization in C++ Codegen through two Intel ISAs: AVX2 and AVX512, as mentioned in the Update 5 Blog. While the Aten library does support Arm as well, we are yet to leverage its NEON/SVE ISAs to generate optimized kernels. The blog also mentions that the VecISA class can be subclassed in order to support other ISAs.
Proposal: I am working on providing NEON ISA support for the TorchInductor's C++ backend. Particularly, I intend to provide a NEON implementation of the vec_reduce_all() function, which currently has optimized AVX2 and AVX512 intrinsics implementations for x86 processors introduced by @mingfeima in #73953, as well as a slow path implementation for other processors including Arm. I have implemented a NEON version for the function, wired up the Inductor's generated C++ to invoke this NEON path on Arm CPUs & I've seen performance improvements, particularly in the Softmax operation.
Posting this here for any discussion before raising a PR.
Alternatives
No response
Additional context
No response
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @Xia-Weiwen @ngimel