[Compile] Add NEON implementation of fp32->bf16 conversion #137131

malfet · 2024-10-01T20:34:28Z

Stack from ghstack (oldest at bottom):

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

[ghstack-poisoned]

pytorch-bot · 2024-10-01T20:34:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137131

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b6ef72d with merge base b1b6816 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 10b7e8c Pull Request resolved: #137131

kimishpatel

left some comments to emit fewer instructions

kimishpatel · 2024-10-02T15:01:23Z

aten/src/ATen/cpu/vec/vec256/vec256_convert.h

+    VectorizedN<BFloat16, 1> result;
+    uint32x4x2_t u32x4x2_0 = vld1q_u32_x2(reinterpret_cast<const uint32_t*>(&src[0]));
+    uint32x4x2_t u32x4x2_1 = vld1q_u32_x2(reinterpret_cast<const uint32_t*>(&src[1]));
+    uint16x4_t u16x4_0 = vmovn_u32(vshrq_n_u32(u32x4x2_0.val[0], 16));


Suggested change

uint16x4_t u16x4_0 = vmovn_u32(vshrq_n_u32(u32x4x2_0.val[0], 16));

uint16x4_t u16x4_0 =vshrn_n_u32(u32x4x2_0.val[0], 16);

You can use narrowing shift to avoid vmovn. ARM is extremely rich in types of instructions it supports but it is also confusing because there can always be more compact way of doing that.

kimishpatel · 2024-10-02T15:03:34Z

aten/src/ATen/cpu/vec/vec256/vec256_convert.h

+    uint32x4x2_t u32x4x2_0 = vld1q_u32_x2(reinterpret_cast<const uint32_t*>(&src[0]));
+    uint32x4x2_t u32x4x2_1 = vld1q_u32_x2(reinterpret_cast<const uint32_t*>(&src[1]));
+    uint16x4_t u16x4_0 = vmovn_u32(vshrq_n_u32(u32x4x2_0.val[0], 16));
+    uint16x4_t u16x4_1 = vmovn_u32(vshrq_n_u32(u32x4x2_0.val[1], 16));


Suggested change

uint16x4_t u16x4_1 = vmovn_u32(vshrq_n_u32(u32x4x2_0.val[1], 16));

uint16x8_t u16x8_low =vshrn_high_n_u32(u16x4_0, u32x4x2_0.val[1], 16);

and this can be used to avoid vcombine

kimishpatel

left some comments to emit fewer instructions

jgong5 · 2024-10-08T13:01:18Z

aten/src/ATen/cpu/vec/vec256/vec256_convert.h

+struct VecConvert<BFloat16, 1, float, 2> {
+  static inline VectorizedN<BFloat16, 1> apply(
+      const VectorizedN<float, 2>& src) {


will you add "1 bf16" -> "2 fp32" too?

github-actions · 2024-12-07T13:35:50Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

Update

b6ef72d

[ghstack-poisoned]

malfet mentioned this pull request Oct 1, 2024

[BE] [NEON] Use vshlq_n_u32 instead of vshlq_u32 #137122

Closed

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 1, 2024

malfet added a commit that referenced this pull request Oct 1, 2024

[Compile] Add NEON implementation of fp32->bf16 conversion

522a5f3

ghstack-source-id: 10b7e8c Pull Request resolved: #137131

malfet added topic: performance topic category release notes: inductor ciflow/linux-aarch64 linux aarch64 CI workflow labels Oct 1, 2024

malfet requested review from jgong5 and kimishpatel October 2, 2024 14:22

kimishpatel requested changes Oct 2, 2024

View reviewed changes

jgong5 reviewed Oct 8, 2024

View reviewed changes

github-actions bot added the Stale label Dec 7, 2024

github-actions bot closed this Jan 6, 2025

github-actions bot deleted the gh/malfet/31/head branch February 9, 2025 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Compile] Add NEON implementation of fp32->bf16 conversion #137131

[Compile] Add NEON implementation of fp32->bf16 conversion #137131

Uh oh!

malfet commented Oct 1, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Oct 1, 2024 •

edited

Loading

Uh oh!

kimishpatel left a comment

Uh oh!

kimishpatel Oct 2, 2024

Uh oh!

kimishpatel Oct 2, 2024

Uh oh!

kimishpatel left a comment

Uh oh!

jgong5 Oct 8, 2024

Uh oh!

github-actions bot commented Dec 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	uint16x4_t u16x4_0 = vmovn_u32(vshrq_n_u32(u32x4x2_0.val[0], 16));
	uint16x4_t u16x4_0 =vshrn_n_u32(u32x4x2_0.val[0], 16);

	uint16x4_t u16x4_1 = vmovn_u32(vshrq_n_u32(u32x4x2_0.val[1], 16));
	uint16x8_t u16x8_low =vshrn_high_n_u32(u16x4_0, u32x4x2_0.val[1], 16);

[Compile] Add NEON implementation of fp32->bf16 conversion #137131

[Compile] Add NEON implementation of fp32->bf16 conversion #137131

Uh oh!

Conversation

malfet commented Oct 1, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137131

✅ No Failures

Uh oh!

kimishpatel left a comment

Choose a reason for hiding this comment

Uh oh!

kimishpatel Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

kimishpatel Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

kimishpatel left a comment

Choose a reason for hiding this comment

Uh oh!

jgong5 Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

malfet commented Oct 1, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 1, 2024 •

edited

Loading