[PyTorch] Convert reduced precision gemv vectorized tail loop to use whole vector register instead of half #137916

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

swolchok wants to merge 9 commits into gh/swolchok/664/base from gh/swolchok/664/head

Contributor

swolchok commented Oct 14, 2024 •

edited

Loading

Stack from ghstack (oldest at bottom):

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: D64280689

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10


          [PyTorch] Convert reduced precision gemv vectorized tail loop to use …

8ee8608

…whole vector register instead of half

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

[ghstack-poisoned]

pytorch-bot bot commented Oct 14, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137916

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 13cf8aa with merge base b9618c9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot added the module: cpu label

Contributor

facebook-github-bot commented Oct 14, 2024

This pull request was exported from Phabricator. Differential Revision: D64280689

facebook-github-bot added the fb-exported label

This was referenced Oct 14, 2024

[PyTorch] Check defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256) instead of defined(CPU_CAPABILITY_NEON) #137722

Closed

[PyTorch] Use 128-bit vectors for ARM64 #137426

Closed

[PyTorch] Move NEON VecConvert specialization from vec256_convert to vec128_convert #137661

Closed

[PyTorch] Specialize Vectorized<Half> for NEON even if FP16 arithmetic isn't available #137911

Closed

[PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::Vectorized #137912

Closed

[PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures #137913

Closed

[PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/cpu/ #137914

Closed

[PyTorch] Clean up Registers/ElementsPerIteration constants #137915

Closed

[PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry points for non-ARM architectures too #137917

Closed

[PyTorch] Hook up fp16_gemv_trans to x86 fp16 GEMM #137918

Closed


          Update on "[PyTorch] Convert reduced precision gemv vectorized tail l…

cea0201

…oop to use whole vector register instead of half"

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 15, 2024

This pull request was exported from Phabricator. Differential Revision: D64280689

swolchok mentioned this pull request

[PyTorch] Hook up fp16_gemv_trans to gemv fast path for non-aarch64 architectures #138005

Closed


          Update on "[PyTorch] Convert reduced precision gemv vectorized tail l…

b8e9d38

…oop to use whole vector register instead of half"

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 17, 2024

This pull request was exported from Phabricator. Differential Revision: D64280689

swolchok mentioned this pull request

[PyTorch] Support non-zero beta in fp16_gemv_trans #138275

Closed

swolchok requested a review from malfet

October 17, 2024 22:55


          Update on "[PyTorch] Convert reduced precision gemv vectorized tail l…

dde479a

…oop to use whole vector register instead of half"

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 22, 2024

This pull request was exported from Phabricator. Differential Revision: D64280689

This was referenced Oct 22, 2024

[PyTorch] Fix inductor CPU masked() body codegen when result dtype is bool and operator is where #138486

Closed

[PyTorch] Fix inductor bug with unrolled vectorized prod #138542

Closed

swolchok added the topic: not user facing label


          Update on "[PyTorch] Convert reduced precision gemv vectorized tail l…

1132e4f

…oop to use whole vector register instead of half"

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 22, 2024

This pull request was exported from Phabricator. Differential Revision: D64280689


          Update on "[PyTorch] Convert reduced precision gemv vectorized tail l…

7336db9

…oop to use whole vector register instead of half"

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 24, 2024

This pull request was exported from Phabricator. Differential Revision: D64280689

This was referenced Oct 24, 2024

[PyTorch] Unbreak VectorizedN fmadd/fmsub/clamp #138655

Closed

[PyTorch] Fix ASAN failures for vec_test_all_types Cast test #138716

Closed

[PyTorch] Fix out-of-bounds array access in atomic_add_vec #138744

Closed


          Update on "[PyTorch] Convert reduced precision gemv vectorized tail l…

e75cb05

…oop to use whole vector register instead of half"

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 24, 2024

This pull request was exported from Phabricator. Differential Revision: D64280689


          Update on "[PyTorch] Convert reduced precision gemv vectorized tail l…

d29b5f3

…oop to use whole vector register instead of half"

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 28, 2024

This pull request was exported from Phabricator. Differential Revision: D64280689

This was referenced Oct 28, 2024

Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel #139081

Closed

[PyTorch] Add efficient isnan for NEON float #139082

Closed

[PyTorch] Add efficient isnan for NEON half #139083

Closed

Extract value_type-generic NEON Vectorized<Half> functions to CRTP base class #139084

Closed

Add Vectorized<c10::BFloat16> specialization for ARM #139090

Closed

[PyTorch] Migrate bf16 gemv fast path kernel from intrinsics to vec::Vectorized #139159

Closed

malfet approved these changes

View reviewed changes

pytorch-bot bot added the ciflow/trunk label


          Update on "[PyTorch] Convert reduced precision gemv vectorized tail l…

13cf8aa

…oop to use whole vector register instead of half"

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 29, 2024

This pull request was exported from Phabricator. Differential Revision: D64280689

This was referenced Oct 29, 2024

Build bf16 gemv fast path & entry points for non-ARM architectures too #139208

Closed

Hook up bf16_gemv_trans to x86 bf16 GEMM #139220

Closed

pytorchmergebot added the Merged label

pytorchmergebot closed this in

fc2d0da

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry point…

b29c170

…s for non-ARM architectures too (#137917)

Remove reasons to gate it on ARM.

Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/)

Pull Request resolved: #137917
Approved by: https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915, #137916

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Convert reduced precision gemv vectorized tail loop to use …

…whole vector register instead of half (pytorch#137916)

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

Pull Request resolved: pytorch#137916
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry point…

…s for non-ARM architectures too (pytorch#137917)

Remove reasons to gate it on ARM.

Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/)

Pull Request resolved: pytorch#137917
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915, pytorch#137916

github-actions bot deleted the gh/swolchok/664/head branch

November 29, 2024 02:13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk fb-exported Merged module: cpu topic: not user facing