[PyTorch] Specialize Vectorized<Half> for NEON even if FP16 arithmetic isn't available #137911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

swolchok wants to merge 9 commits into gh/swolchok/659/base from gh/swolchok/659/head

Contributor

swolchok commented Oct 14, 2024 •

edited

Loading

Stack from ghstack (oldest at bottom):

We can do most of what this header does (by line count) anyway by converting to and from float.

Differential Revision: D64265757

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10


          [PyTorch] Specialize Vectorized<Half> for NEON even if FP16 arithmeti…

…c isn't available

We can do most of what this header does (by line count) anyway by converting to and from float.

Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/)

[ghstack-poisoned]

pytorch-bot bot commented Oct 14, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137911

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6c24f9c with merge base b9618c9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot added the module: cpu label

Contributor

facebook-github-bot commented Oct 14, 2024

This pull request was exported from Phabricator. Differential Revision: D64265757

facebook-github-bot added the fb-exported label

This was referenced Oct 14, 2024

[PyTorch] Check defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256) instead of defined(CPU_CAPABILITY_NEON) #137722

Closed

[PyTorch] Use 128-bit vectors for ARM64 #137426

Closed

[PyTorch] Move NEON VecConvert specialization from vec256_convert to vec128_convert #137661

Closed

[PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::Vectorized #137912

Closed

[PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures #137913

Closed

[PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/cpu/ #137914

Closed

[PyTorch] Clean up Registers/ElementsPerIteration constants #137915

Closed

[PyTorch] Convert reduced precision gemv vectorized tail loop to use whole vector register instead of half #137916

Closed

[PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry points for non-ARM architectures too #137917

Closed

[PyTorch] Hook up fp16_gemv_trans to x86 fp16 GEMM #137918

Closed


          Update on "[PyTorch] Specialize Vectorized<Half> for NEON even if FP1…

…6 arithmetic isn't available"

We can do most of what this header does (by line count) anyway by converting to and from float.

Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 15, 2024

This pull request was exported from Phabricator. Differential Revision: D64265757

swolchok mentioned this pull request

[PyTorch] Hook up fp16_gemv_trans to gemv fast path for non-aarch64 architectures #138005

Closed


          Update on "[PyTorch] Specialize Vectorized<Half> for NEON even if FP1…

7b031de

…6 arithmetic isn't available"

We can do most of what this header does (by line count) anyway by converting to and from float.

Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 17, 2024

This pull request was exported from Phabricator. Differential Revision: D64265757

swolchok mentioned this pull request

[PyTorch] Support non-zero beta in fp16_gemv_trans #138275

Closed

swolchok added the topic: not user facing label

Contributor Author

swolchok commented Oct 17, 2024

(it's hard to succinctly explain what the benefits of this one are, so I gave it a not-user-facing for release notes)

swolchok requested review from jgong5 and malfet and removed request for malfet

October 17, 2024 22:52


          Update on "[PyTorch] Specialize Vectorized<Half> for NEON even if FP1…

a967480

…6 arithmetic isn't available"

We can do most of what this header does (by line count) anyway by converting to and from float.

Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 22, 2024

This pull request was exported from Phabricator. Differential Revision: D64265757

swolchok mentioned this pull request

[PyTorch] Fix inductor CPU masked() body codegen when result dtype is bool and operator is where #138486

Closed

This was referenced Oct 28, 2024

[PyTorch] Add efficient isnan for NEON float #139082

Closed

[PyTorch] Add efficient isnan for NEON half #139083

Closed

Extract value_type-generic NEON Vectorized<Half> functions to CRTP base class #139084

Closed

Add Vectorized<c10::BFloat16> specialization for ARM #139090

Closed

Contributor Author

swolchok commented Oct 28, 2024

'test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_multi_threading_dynamic_shapes_cpu'

this failure doesn't seem to repro locally on a linux machine, nor does it pass the sniff test because this diff only affects ARM and it's on Windows...

swolchok mentioned this pull request

[PyTorch] Migrate bf16 gemv fast path kernel from intrinsics to vec::Vectorized #139159

Closed


          Update on "[PyTorch] Specialize Vectorized<Half> for NEON even if FP1…

6c24f9c

…6 arithmetic isn't available"

We can do most of what this header does (by line count) anyway by converting to and from float.

Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 29, 2024

This pull request was exported from Phabricator. Differential Revision: D64265757

This was referenced Oct 29, 2024

Build bf16 gemv fast path & entry points for non-ARM architectures too #139208

Closed

Hook up bf16_gemv_trans to x86 bf16 GEMM #139220

Closed

pytorchmergebot added the Merged label

pytorchmergebot closed this in

41d7471

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::…

9ede4b2

…Vectorized (#137912)

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

Pull Request resolved: #137912
Approved by: https://github.com/malfet
ghstack dependencies: #137661, #137911

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures (#…

6502d6c

…137913)

float16_t is ARM-specific. Half is not.

Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/)

Pull Request resolved: #137913
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/c…

aafbea4

…pu/ (#137914)

This is in preparation for supporting x86 as well; we need to
be in this directory so that we can get rebuilt with different
CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts
fulfilling request from @malfet to split the ARM64 fast path stuff
into its own file. BFloat16 will be in a later diff.

Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/)

Pull Request resolved: #137914
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Clean up Registers/ElementsPerIteration constants (#137915)

5be1556

In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.)

Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/)

Pull Request resolved: #137915
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913, #137914

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Convert reduced precision gemv vectorized tail loop to use …

fc2d0da

…whole vector register instead of half (#137916)

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

Pull Request resolved: #137916
Approved by: https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry point…

b29c170

…s for non-ARM architectures too (#137917)

Remove reasons to gate it on ARM.

Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/)

Pull Request resolved: #137917
Approved by: https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915, #137916

malfet mentioned this pull request

Fix vec128_half_neon.h compilation with GCC #139235

Closed

pytorchmergebot pushed a commit that referenced this pull request


          Fix vec128_half_neon.h compilation with GCC (#139235)

f643499

`mask` is already defined as `uint16x8_t` no need to reinterpret it
https://github.com/pytorch/pytorch/blob/bd369bb18258fc3be5ee91f8fcaf06a4b6fc41a7/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h#L220

Fixes
```
var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h: In static member function 'static at::vec::DEFAULT::Vectorized<c10::Half> at::vec::DEFAULT::Vectorized<c10::Half>::set(const at::vec::DEFAULT::Vectorized<c10::Half>&, const at::vec::DEFAULT::Vectorized<c10::Half>&, int64_t)':
/var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h:227:39: error: cannot convert 'uint16x8_t' to 'float16x8_t'
  227 |                 vreinterpretq_u16_f16(mask),
      |                                       ^~~~
      |                                       |
      |                                       uint16x8_t
In file included from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/intrinsics.h:23,
                 from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128.h:4,
                 from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec.h:6,
                 from /var/lib/jenkins/workspace/aten/src/ATen/test/vec_test_all_types.h:2,
                 from /var/lib/jenkins/workspace/aten/src/ATen/test/vec_test_all_types.cpp:1:
/usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:5841:36: note:   initializing argument 1 of 'uint16x8_t vreinterpretq_u16_f16(float16x8_t)'
 5841 | vreinterpretq_u16_f16 (float16x8_t __a)
      |                        ~~~~~~~~~~~~^~~
```

introduced by #137911

Also, guard any use of NEON intrinsics in `ReducedPrecisionFloatGemvFastPathKernel.cpp` with `!defined(CPU_CAPABILITY_SVE)` otherwise compilation fails with
```
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp: In function 'float at::native::SVE256::reduce(at::vec::SVE256::VectorizedN<c10::Half, 16>&)':
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:77:24: error: cannot convert 'at::vec::SVE256::Vectorized<float>' to 'float32x4_t'
   77 |   return vaddvq_f32(t0 + t1);
      |                     ~~~^~~~
      |                        |
      |                        at::vec::SVE256::Vectorized<float>
In file included from /var/lib/jenkins/workspace/c10/util/Half.h:51,
                 from /var/lib/jenkins/workspace/c10/util/Float8_e5m2.h:17,
                 from /var/lib/jenkins/workspace/c10/core/ScalarType.h:8,
                 from /var/lib/jenkins/workspace/c10/core/TensorImpl.h:11,
                 from /var/lib/jenkins/workspace/c10/core/GeneratorImpl.h:8,
                 from /var/lib/jenkins/workspace/aten/src/ATen/core/Generator.h:18,
                 from /var/lib/jenkins/workspace/aten/src/ATen/CPUGeneratorImpl.h:3,
                 from /var/lib/jenkins/workspace/aten/src/ATen/Context.h:4,
                 from /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:2,
                 from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.SVE256.cpp:1:
/usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:10423:25: note:   initializing argument 1 of 'float32_t vaddvq_f32(float32x4_t)'
10423 | vaddvq_f32 (float32x4_t __a)
      |             ~~~~~~~~~~~~^~~
In file included from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.SVE256.cpp:1:
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp: In function 'float at::native::SVE256::reduce(at::vec::SVE256::Vectorized<float>)':
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:119:21: error: cannot convert 'at::vec::SVE256::Vectorized<float>' to 'float32x4_t'
  119 |   return vaddvq_f32(x);
      |                     ^
      |                     |
      |                     at::vec::SVE256::Vectorized<float>
```

Pull Request resolved: #139235
Approved by: https://github.com/huydhn

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Specialize Vectorized<Half> for NEON even if FP16 arithmeti…

…c isn't available (pytorch#137911)

We can do most of what this header does (by line count) anyway by converting to and from float.

Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/)

Pull Request resolved: pytorch#137911
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: pytorch#137661

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::…

607a5ab

…Vectorized (pytorch#137912)

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

Pull Request resolved: pytorch#137912
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures (p…

bfb12c3

…ytorch#137913)

float16_t is ARM-specific. Half is not.

Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/)

Pull Request resolved: pytorch#137913
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/c…

fd55e52

…pu/ (pytorch#137914)

This is in preparation for supporting x86 as well; we need to
be in this directory so that we can get rebuilt with different
CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts
fulfilling request from @malfet to split the ARM64 fast path stuff
into its own file. BFloat16 will be in a later diff.

Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/)

Pull Request resolved: pytorch#137914
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Clean up Registers/ElementsPerIteration constants (pytorch#…

e16cfce

…137915)

In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.)

Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/)

Pull Request resolved: pytorch#137915
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Convert reduced precision gemv vectorized tail loop to use …

…whole vector register instead of half (pytorch#137916)

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

Pull Request resolved: pytorch#137916
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry point…

…s for non-ARM architectures too (pytorch#137917)

Remove reasons to gate it on ARM.

Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/)

Pull Request resolved: pytorch#137917
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915, pytorch#137916

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          Fix vec128_half_neon.h compilation with GCC (pytorch#139235)

89db11d

`mask` is already defined as `uint16x8_t` no need to reinterpret it
https://github.com/pytorch/pytorch/blob/bd369bb18258fc3be5ee91f8fcaf06a4b6fc41a7/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h#L220

Fixes
```
var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h: In static member function 'static at::vec::DEFAULT::Vectorized<c10::Half> at::vec::DEFAULT::Vectorized<c10::Half>::set(const at::vec::DEFAULT::Vectorized<c10::Half>&, const at::vec::DEFAULT::Vectorized<c10::Half>&, int64_t)':
/var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h:227:39: error: cannot convert 'uint16x8_t' to 'float16x8_t'
  227 |                 vreinterpretq_u16_f16(mask),
      |                                       ^~~~
      |                                       |
      |                                       uint16x8_t
In file included from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/intrinsics.h:23,
                 from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128.h:4,
                 from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec.h:6,
                 from /var/lib/jenkins/workspace/aten/src/ATen/test/vec_test_all_types.h:2,
                 from /var/lib/jenkins/workspace/aten/src/ATen/test/vec_test_all_types.cpp:1:
/usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:5841:36: note:   initializing argument 1 of 'uint16x8_t vreinterpretq_u16_f16(float16x8_t)'
 5841 | vreinterpretq_u16_f16 (float16x8_t __a)
      |                        ~~~~~~~~~~~~^~~
```

introduced by pytorch#137911

Also, guard any use of NEON intrinsics in `ReducedPrecisionFloatGemvFastPathKernel.cpp` with `!defined(CPU_CAPABILITY_SVE)` otherwise compilation fails with
```
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp: In function 'float at::native::SVE256::reduce(at::vec::SVE256::VectorizedN<c10::Half, 16>&)':
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:77:24: error: cannot convert 'at::vec::SVE256::Vectorized<float>' to 'float32x4_t'
   77 |   return vaddvq_f32(t0 + t1);
      |                     ~~~^~~~
      |                        |
      |                        at::vec::SVE256::Vectorized<float>
In file included from /var/lib/jenkins/workspace/c10/util/Half.h:51,
                 from /var/lib/jenkins/workspace/c10/util/Float8_e5m2.h:17,
                 from /var/lib/jenkins/workspace/c10/core/ScalarType.h:8,
                 from /var/lib/jenkins/workspace/c10/core/TensorImpl.h:11,
                 from /var/lib/jenkins/workspace/c10/core/GeneratorImpl.h:8,
                 from /var/lib/jenkins/workspace/aten/src/ATen/core/Generator.h:18,
                 from /var/lib/jenkins/workspace/aten/src/ATen/CPUGeneratorImpl.h:3,
                 from /var/lib/jenkins/workspace/aten/src/ATen/Context.h:4,
                 from /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:2,
                 from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.SVE256.cpp:1:
/usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:10423:25: note:   initializing argument 1 of 'float32_t vaddvq_f32(float32x4_t)'
10423 | vaddvq_f32 (float32x4_t __a)
      |             ~~~~~~~~~~~~^~~
In file included from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.SVE256.cpp:1:
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp: In function 'float at::native::SVE256::reduce(at::vec::SVE256::Vectorized<float>)':
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:119:21: error: cannot convert 'at::vec::SVE256::Vectorized<float>' to 'float32x4_t'
  119 |   return vaddvq_f32(x);
      |                     ^
      |                     |
      |                     at::vec::SVE256::Vectorized<float>
```

Pull Request resolved: pytorch#139235
Approved by: https://github.com/huydhn

github-actions bot deleted the gh/swolchok/659/head branch

November 29, 2024 02:13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk fb-exported Merged module: cpu topic: not user facing