Skip to content

Conversation

@malfet
Copy link
Contributor

@malfet malfet commented Oct 1, 2024

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 1, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137131

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b6ef72d with merge base b1b6816 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 1, 2024
malfet added a commit that referenced this pull request Oct 1, 2024
@malfet malfet requested review from jgong5 and kimishpatel October 2, 2024 14:22
Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments to emit fewer instructions

VectorizedN<BFloat16, 1> result;
uint32x4x2_t u32x4x2_0 = vld1q_u32_x2(reinterpret_cast<const uint32_t*>(&src[0]));
uint32x4x2_t u32x4x2_1 = vld1q_u32_x2(reinterpret_cast<const uint32_t*>(&src[1]));
uint16x4_t u16x4_0 = vmovn_u32(vshrq_n_u32(u32x4x2_0.val[0], 16));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint16x4_t u16x4_0 = vmovn_u32(vshrq_n_u32(u32x4x2_0.val[0], 16));
uint16x4_t u16x4_0 =vshrn_n_u32(u32x4x2_0.val[0], 16);

You can use narrowing shift to avoid vmovn. ARM is extremely rich in types of instructions it supports but it is also confusing because there can always be more compact way of doing that.

uint32x4x2_t u32x4x2_0 = vld1q_u32_x2(reinterpret_cast<const uint32_t*>(&src[0]));
uint32x4x2_t u32x4x2_1 = vld1q_u32_x2(reinterpret_cast<const uint32_t*>(&src[1]));
uint16x4_t u16x4_0 = vmovn_u32(vshrq_n_u32(u32x4x2_0.val[0], 16));
uint16x4_t u16x4_1 = vmovn_u32(vshrq_n_u32(u32x4x2_0.val[1], 16));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint16x4_t u16x4_1 = vmovn_u32(vshrq_n_u32(u32x4x2_0.val[1], 16));
uint16x8_t u16x8_low =vshrn_high_n_u32(u16x4_0, u32x4x2_0.val[1], 16);

and this can be used to avoid vcombine

Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments to emit fewer instructions

Comment on lines +319 to +321
struct VecConvert<BFloat16, 1, float, 2> {
static inline VectorizedN<BFloat16, 1> apply(
const VectorizedN<float, 2>& src) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will you add "1 bf16" -> "2 fp32" too?

@github-actions
Copy link
Contributor

github-actions bot commented Dec 7, 2024

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Dec 7, 2024
@github-actions github-actions bot closed this Jan 6, 2025
@github-actions github-actions bot deleted the gh/malfet/31/head branch February 9, 2025 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/linux-aarch64 linux aarch64 CI workflow module: cpu CPU specific problem (e.g., perf, algorithm) release notes: inductor Stale topic: performance topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants