Unbreak vec128_half_neon comparison without FP16 hardware support #139558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

swolchok wants to merge 7 commits into gh/swolchok/688/base from gh/swolchok/688/head

Contributor

swolchok commented Nov 2, 2024 •

edited by pytorch-bot bot

Loading

Stack from ghstack (oldest at bottom):

Discovered this bug when working on Vectorized; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: D65385267

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10


          [PyTorch] Unbreak vec128_half_neon comparison ops without FP16 hardwa…

d3f8125

…re suppo

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

[ghstack-poisoned]

pytorch-bot bot added the module: cpu label

pytorch-bot bot commented Nov 2, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139558

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bcb32c3 with merge base 419a7e1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Contributor

facebook-github-bot commented Nov 2, 2024

This pull request was exported from Phabricator. Differential Revision: D65385267

facebook-github-bot added the fb-exported label

This was referenced Nov 2, 2024

Extract value_type-generic NEON Vectorized<Half> functions to CRTP base class #139084

Closed

Add Vectorized<c10::BFloat16> specialization for ARM #139090

Closed

Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel #139081

Closed

Build bf16 gemv fast path & entry points for non-ARM architectures too #139208

Closed

Hook up bf16_gemv_trans to x86 bf16 GEMM #139220

Closed

Contributor Author

swolchok commented Nov 2, 2024

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

this is only a problem if FCVTN canonicalizes NaNs. It's implementation-defined whether it will do that by default; the behavior is controlled by the DN bit of the FPCR, which is unknown after a reset. (https://developer.arm.com/documentation/ddi0595/2021-03/AArch64-Registers/FPCR--Floating-point-Control-Register?lang=en#fieldset_0-25_25) Therefore I cannot repro a bug locally with the existing behavior, but I can check to make sure things still work with this diff.

swolchok requested a review from malfet

November 2, 2024 19:13

swolchok added release notes: jit topic: bug fixes labels

swolchok changed the title ~~[PyTorch] Unbreak vec128_half_neon comparison ops without FP16 hardware suppo~~ Unbreak vec128_half_neon comparison without FP16 hardware support

pytorch-bot bot added the ciflow/trunk label


          Update on "Unbreak vec128_half_neon comparison without FP16 hardware …

a7004f2

…support"

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 2, 2024

This pull request was exported from Phabricator. Differential Revision: D65385267


          Update on "Unbreak vec128_half_neon comparison without FP16 hardware …

f2b4649

…support"

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 3, 2024

This pull request was exported from Phabricator. Differential Revision: D65385267


          Update on "Unbreak vec128_half_neon comparison without FP16 hardware …

5309fdf

…support"

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 3, 2024

This pull request was exported from Phabricator. Differential Revision: D65385267


          Update on "Unbreak vec128_half_neon comparison without FP16 hardware …

86e32a3

…support"

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 3, 2024

This pull request was exported from Phabricator. Differential Revision: D65385267


          Update on "Unbreak vec128_half_neon comparison without FP16 hardware …

1cd6313

…support"

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 4, 2024

This pull request was exported from Phabricator. Differential Revision: D65385267


          Update on "Unbreak vec128_half_neon comparison without FP16 hardware …

bcb32c3

…support"

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 4, 2024

This pull request was exported from Phabricator. Differential Revision: D65385267

malfet approved these changes

View reviewed changes

aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h

    
                  return Vectorized<c10::Half>(vcombine_f16(r00, r01));

                }

                Vectorized<c10::Half> map2_bitmask_with_vec_float_method(

Contributor

malfet Nov 8, 2024

Yuck

pytorchmergebot closed this in

44f6d14

pytorchmergebot pushed a commit that referenced this pull request


          Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel (#139081

7f0bf9f

)

Following the previous move of fp16_gemv_trans.

Testing: Checked for performance regression with llm_benchmarks' `python benchmarks/benchmark_torch_mm.py llm`, didn't find one
Differential Revision: [D64930872](https://our.internmc.facebook.com/intern/diff/D64930872/)

Pull Request resolved: #139081
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090, #139558

pytorchmergebot added the Merged label

pytorchmergebot pushed a commit that referenced this pull request


          Build bf16 gemv fast path & entry points for non-ARM architectures too (

25c469b

#139208)

Very similar to #137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

Pull Request resolved: #139208
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090, #139558, #139081

pytorchmergebot pushed a commit that referenced this pull request


          Hook up bf16_gemv_trans to x86 bf16 GEMM (#139220)

cc44b55

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: #139220
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090, #139558, #139081, #139208

atalman pushed a commit to atalman/pytorch that referenced this pull request


          Unbreak vec128_half_neon comparison without FP16 hardware support (py…

1d036bb

…torch#139558)

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

Pull Request resolved: pytorch#139558
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090

atalman pushed a commit to atalman/pytorch that referenced this pull request


          Build bf16 gemv fast path & entry points for non-ARM architectures too (

e499212

pytorch#139208)

Very similar to pytorch#137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

Pull Request resolved: pytorch#139208
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081

atalman pushed a commit to atalman/pytorch that referenced this pull request


          Hook up bf16_gemv_trans to x86 bf16 GEMM (pytorch#139220)

e6c2988

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: pytorch#139220
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081, pytorch#139208

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Unbreak vec128_half_neon comparison without FP16 hardware support (py…

…torch#139558)

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

Pull Request resolved: pytorch#139558
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel (pyto…

c300e15

…rch#139081)

Following the previous move of fp16_gemv_trans.

Testing: Checked for performance regression with llm_benchmarks' `python benchmarks/benchmark_torch_mm.py llm`, didn't find one
Differential Revision: [D64930872](https://our.internmc.facebook.com/intern/diff/D64930872/)

Pull Request resolved: pytorch#139081
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Build bf16 gemv fast path & entry points for non-ARM architectures too (

1d946ad

pytorch#139208)

Very similar to pytorch#137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

Pull Request resolved: pytorch#139208
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Hook up bf16_gemv_trans to x86 bf16 GEMM (pytorch#139220)

1a8f885

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: pytorch#139220
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081, pytorch#139208

github-actions bot deleted the gh/swolchok/688/head branch

December 9, 2024 02:14

fmo-mt pushed a commit to fmo-mt/pytorch that referenced this pull request


          Unbreak vec128_half_neon comparison without FP16 hardware support (py…

fce5ef5

…torch#139558)

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

Pull Request resolved: pytorch#139558
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090

fmo-mt pushed a commit to fmo-mt/pytorch that referenced this pull request


          Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel (pyto…

2e05b7a

…rch#139081)

Following the previous move of fp16_gemv_trans.

Testing: Checked for performance regression with llm_benchmarks' `python benchmarks/benchmark_torch_mm.py llm`, didn't find one
Differential Revision: [D64930872](https://our.internmc.facebook.com/intern/diff/D64930872/)

Pull Request resolved: pytorch#139081
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558

fmo-mt pushed a commit to fmo-mt/pytorch that referenced this pull request


          Build bf16 gemv fast path & entry points for non-ARM architectures too (

2e01fef

pytorch#139208)

Very similar to pytorch#137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

Pull Request resolved: pytorch#139208
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081

fmo-mt pushed a commit to fmo-mt/pytorch that referenced this pull request


          Hook up bf16_gemv_trans to x86 bf16 GEMM (pytorch#139220)

f74e7e4

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: pytorch#139220
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081, pytorch#139208

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk fb-exported Merged module: cpu release notes: jit topic: bug fixes