-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[Compile] Add NEON implementation of fp32->bf16 conversion #137131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137131
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b6ef72d with merge base b1b6816 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
kimishpatel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments to emit fewer instructions
| VectorizedN<BFloat16, 1> result; | ||
| uint32x4x2_t u32x4x2_0 = vld1q_u32_x2(reinterpret_cast<const uint32_t*>(&src[0])); | ||
| uint32x4x2_t u32x4x2_1 = vld1q_u32_x2(reinterpret_cast<const uint32_t*>(&src[1])); | ||
| uint16x4_t u16x4_0 = vmovn_u32(vshrq_n_u32(u32x4x2_0.val[0], 16)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| uint16x4_t u16x4_0 = vmovn_u32(vshrq_n_u32(u32x4x2_0.val[0], 16)); | |
| uint16x4_t u16x4_0 =vshrn_n_u32(u32x4x2_0.val[0], 16); |
You can use narrowing shift to avoid vmovn. ARM is extremely rich in types of instructions it supports but it is also confusing because there can always be more compact way of doing that.
| uint32x4x2_t u32x4x2_0 = vld1q_u32_x2(reinterpret_cast<const uint32_t*>(&src[0])); | ||
| uint32x4x2_t u32x4x2_1 = vld1q_u32_x2(reinterpret_cast<const uint32_t*>(&src[1])); | ||
| uint16x4_t u16x4_0 = vmovn_u32(vshrq_n_u32(u32x4x2_0.val[0], 16)); | ||
| uint16x4_t u16x4_1 = vmovn_u32(vshrq_n_u32(u32x4x2_0.val[1], 16)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| uint16x4_t u16x4_1 = vmovn_u32(vshrq_n_u32(u32x4x2_0.val[1], 16)); | |
| uint16x8_t u16x8_low =vshrn_high_n_u32(u16x4_0, u32x4x2_0.val[1], 16); |
and this can be used to avoid vcombine
kimishpatel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments to emit fewer instructions
| struct VecConvert<BFloat16, 1, float, 2> { | ||
| static inline VectorizedN<BFloat16, 1> apply( | ||
| const VectorizedN<float, 2>& src) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will you add "1 bf16" -> "2 fp32" too?
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Stack from ghstack (oldest at bottom):
vshlq_n_u32instead ofvshlq_u32#137122cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10