[PyTorch] Move NEON VecConvert specialization from vec256_convert to vec128_convert #137661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

swolchok wants to merge 15 commits into gh/swolchok/651/base from gh/swolchok/651/head

Contributor

swolchok commented Oct 9, 2024 •

edited

Loading

Stack from ghstack (oldest at bottom):

NEON vectors are 128-bit and don't belong with 256 stuff.

Differential Revision: D64143615

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10


          [PyTorch] Move NEON VecConvert specialization from vec256_convert to …

35de4dd

…vec128_convert

NEON vectors are 128-bit and don't belong with 256 stuff.

Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/)

[ghstack-poisoned]

pytorch-bot bot commented Oct 9, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137661

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 59a2c71 with merge base b9618c9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot added the module: cpu label

swolchok mentioned this pull request

[PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel, Try #2 #137377

Closed

Contributor

facebook-github-bot commented Oct 9, 2024

This pull request was exported from Phabricator. Differential Revision: D64143615

swolchok mentioned this pull request

[PyTorch] Use 128-bit vectors for ARM64 #137426

Closed

facebook-github-bot added the fb-exported label

swolchok added a commit that referenced this pull request


          [PyTorch] Move NEON VecConvert specialization from vec256_convert to …

b86f2d8

…vec128_convert

NEON vectors are 128-bit and don't belong with 256 stuff.

Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/)

ghstack-source-id: 247178186
Pull Request resolved: #137661


          Update on "[PyTorch] Move NEON VecConvert specialization from vec256_…

f4952ec

…convert to vec128_convert"

NEON vectors are 128-bit and don't belong with 256 stuff.

Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 10, 2024

This pull request was exported from Phabricator. Differential Revision: D64143615

This was referenced Oct 10, 2024

[PyTorch] Check defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256) instead of defined(CPU_CAPABILITY_NEON) #137722

Closed

[PyTorch] add NEON half2float fmadd/fmsub #137723

Closed


          Update on "[PyTorch] Move NEON VecConvert specialization from vec256_…

8e1773c

…convert to vec128_convert"

NEON vectors are 128-bit and don't belong with 256 stuff.

Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 10, 2024

This pull request was exported from Phabricator. Differential Revision: D64143615


          Update on "[PyTorch] Move NEON VecConvert specialization from vec256_…

40a64df

…convert to vec128_convert"

NEON vectors are 128-bit and don't belong with 256 stuff.

Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 10, 2024

This pull request was exported from Phabricator. Differential Revision: D64143615


          unbreak build on "[PyTorch] Move NEON VecConvert specialization from …

684af12

…vec256_convert to vec128_convert"

NEON vectors are 128-bit and don't belong with 256 stuff.

Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 10, 2024

This pull request was exported from Phabricator. Differential Revision: D64143615

swolchok added a commit that referenced this pull request


          [PyTorch] Move NEON VecConvert specialization from vec256_convert to …

f270dd5

…vec128_convert

Pull Request resolved: #137661

NEON vectors are 128-bit and don't belong with 256 stuff.
ghstack-source-id: 247393002
@exported-using-ghexport

Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/)

swolchok added the topic: not user facing label

swolchok requested review from jgong5, kimishpatel and malfet and removed request for kimishpatel and malfet

October 11, 2024 01:25

jgong5 approved these changes

View reviewed changes

pytorch-bot bot added the ciflow/trunk label


          Update on "[PyTorch] Move NEON VecConvert specialization from vec256_…

385464b

…convert to vec128_convert"

NEON vectors are 128-bit and don't belong with 256 stuff.

Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 14, 2024

This pull request was exported from Phabricator. Differential Revision: D64143615

Contributor

facebook-github-bot commented Oct 28, 2024

This pull request was exported from Phabricator. Differential Revision: D64143615

This was referenced Oct 28, 2024

Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel #139081

Closed

[PyTorch] Add efficient isnan for NEON float #139082

Closed

[PyTorch] Add efficient isnan for NEON half #139083

Closed

Extract value_type-generic NEON Vectorized<Half> functions to CRTP base class #139084

Closed

Add Vectorized<c10::BFloat16> specialization for ARM #139090

Closed

[PyTorch] Migrate bf16 gemv fast path kernel from intrinsics to vec::Vectorized #139159

Closed


          Update on "[PyTorch] Move NEON VecConvert specialization from vec256_…

59a2c71

…convert to vec128_convert"

NEON vectors are 128-bit and don't belong with 256 stuff.

Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 29, 2024

This pull request was exported from Phabricator. Differential Revision: D64143615

This was referenced Oct 29, 2024

Build bf16 gemv fast path & entry points for non-ARM architectures too #139208

Closed

Hook up bf16_gemv_trans to x86 bf16 GEMM #139220

Closed

pytorchmergebot closed this in

837538f

pytorchmergebot added the Merged label

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Specialize Vectorized<Half> for NEON even if FP16 arithmeti…

41d7471

…c isn't available (#137911)

We can do most of what this header does (by line count) anyway by converting to and from float.

Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/)

Pull Request resolved: #137911
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #137661

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::…

9ede4b2

…Vectorized (#137912)

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

Pull Request resolved: #137912
Approved by: https://github.com/malfet
ghstack dependencies: #137661, #137911

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures (#…

6502d6c

…137913)

float16_t is ARM-specific. Half is not.

Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/)

Pull Request resolved: #137913
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/c…

aafbea4

…pu/ (#137914)

This is in preparation for supporting x86 as well; we need to
be in this directory so that we can get rebuilt with different
CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts
fulfilling request from @malfet to split the ARM64 fast path stuff
into its own file. BFloat16 will be in a later diff.

Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/)

Pull Request resolved: #137914
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Clean up Registers/ElementsPerIteration constants (#137915)

5be1556

In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.)

Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/)

Pull Request resolved: #137915
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913, #137914

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Convert reduced precision gemv vectorized tail loop to use …

fc2d0da

…whole vector register instead of half (#137916)

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

Pull Request resolved: #137916
Approved by: https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry point…

b29c170

…s for non-ARM architectures too (#137917)

Remove reasons to gate it on ARM.

Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/)

Pull Request resolved: #137917
Approved by: https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915, #137916

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Move NEON VecConvert specialization from vec256_convert to …

51e5d02

…vec128_convert (pytorch#137661)

NEON vectors are 128-bit and don't belong with 256 stuff.

Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/)

Pull Request resolved: pytorch#137661
Approved by: https://github.com/jgong5, https://github.com/malfet

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Specialize Vectorized<Half> for NEON even if FP16 arithmeti…

…c isn't available (pytorch#137911)

We can do most of what this header does (by line count) anyway by converting to and from float.

Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/)

Pull Request resolved: pytorch#137911
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: pytorch#137661

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::…

607a5ab

…Vectorized (pytorch#137912)

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

Pull Request resolved: pytorch#137912
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures (p…

bfb12c3

…ytorch#137913)

float16_t is ARM-specific. Half is not.

Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/)

Pull Request resolved: pytorch#137913
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/c…

fd55e52

…pu/ (pytorch#137914)

This is in preparation for supporting x86 as well; we need to
be in this directory so that we can get rebuilt with different
CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts
fulfilling request from @malfet to split the ARM64 fast path stuff
into its own file. BFloat16 will be in a later diff.

Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/)

Pull Request resolved: pytorch#137914
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Clean up Registers/ElementsPerIteration constants (pytorch#…

e16cfce

…137915)

In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.)

Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/)

Pull Request resolved: pytorch#137915
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Convert reduced precision gemv vectorized tail loop to use …

…whole vector register instead of half (pytorch#137916)

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

Pull Request resolved: pytorch#137916
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry point…

…s for non-ARM architectures too (pytorch#137917)

Remove reasons to gate it on ARM.

Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/)

Pull Request resolved: pytorch#137917
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915, pytorch#137916

github-actions bot deleted the gh/swolchok/651/head branch

November 29, 2024 02:13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk fb-exported Merged module: cpu topic: not user facing