-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::Vectorized #137912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…Vectorized Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137912
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c8f47af with merge base b9618c9 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D64218206 |
…cs to vec::Vectorized" Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D64218206 |
…cs to vec::Vectorized" Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D64218206 |
…cs to vec::Vectorized" Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D64218206 |
…cs to vec::Vectorized" Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D64218206 |
…cs to vec::Vectorized" Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) [ghstack-poisoned]
…el from intrinsics to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) [ghstack-poisoned]
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) [ghstack-poisoned]
…el from intrinsics to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) [ghstack-poisoned]
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) [ghstack-poisoned]
…137913) float16_t is ARM-specific. Half is not. Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/) Pull Request resolved: #137913 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912
…pu/ (#137914) This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from @malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) Pull Request resolved: #137914 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913
In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.) Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/) Pull Request resolved: #137915 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914
…whole vector register instead of half (#137916) The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler. Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/) Pull Request resolved: #137916 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915
…s for non-ARM architectures too (#137917) Remove reasons to gate it on ARM. Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/) Pull Request resolved: #137917 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915, #137916
…el from intrinsics to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…el from intrinsics to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…el from intrinsics to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…yTorch] Migrate bf16 gemv fast path kernel from intrinsics to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…bf16 gemv fast path kernel from intrinsics to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…Vectorized (pytorch#137912) Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) Pull Request resolved: pytorch#137912 Approved by: https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911
…ytorch#137913) float16_t is ARM-specific. Half is not. Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/) Pull Request resolved: pytorch#137913 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912
…pu/ (pytorch#137914) This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from @malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) Pull Request resolved: pytorch#137914 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913
…137915) In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.) Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/) Pull Request resolved: pytorch#137915 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914
…whole vector register instead of half (pytorch#137916) The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler. Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/) Pull Request resolved: pytorch#137916 Approved by: https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915
…s for non-ARM architectures too (pytorch#137917) Remove reasons to gate it on ARM. Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/) Pull Request resolved: pytorch#137917 Approved by: https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915, pytorch#137916
Stack from ghstack (oldest at bottom):
Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)
Differential Revision: D64218206
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10