[Inductor] Add NEON ISA support on arm64 Macs #122217

malfet · 2024-03-19T18:42:00Z

This started as a re-land of #105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions)

Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS:

Following was added/changed to enable vectorization code to work on MacOS

Added VecNEON class to _inductor/codecache.py that is supported on all AppleSilicon Macs
Added Vectorized::loadu_one_fourth to vec_base.h, and limit it to 8-bit types
Change 64-bit integral types mapping to int64_t/uint64_t to align with the rest of the code, as on MacOS, int64_t is a long long rather than long (see [C10] Make Scalar constructable from longs #118149 for more details)

See table below for perf changes with and without torch.compile using gpt-fast running stories15M on M2 Pro:

dtype	Eager	Compile (before)	Compile (after)
bfloat16	120 tokens/sec	130 tokens/sec	156 tokens/sec
float32	158 tokens/sec	140 tokens/sec	236 tokens/sec
float16	235 tokens/sec	81 tokens/sec	58 tokens/sec

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

pytorch-bot · 2024-03-19T18:42:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122217

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c6f3d57 with merge base eda279c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This is a re-land of #105590 but this time enbaling it only for Darwin platform where those instructions are available by default

malfet · 2024-03-26T05:05:30Z

@pytorchbot merge

pytorchmergebot · 2024-03-26T05:07:17Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This started as a re-land of #105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions) Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS: - #122511 - #122513 - #122580 - #122608 Following was added/changed to enable vectorization code to work on MacOS - Added VecNEON class to `_inductor/codecache.py` that is supported on all AppleSilicon Macs - Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types - Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see #118149 for more details) See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro: | dtype | Eager | Compile (before) | Compile (after) | | ------ | ------ | --------- | --------- | | bfloat16 | 120 tokens/sec | 130 tokens/sec | 156 tokens/sec | | float32 | 158 tokens/sec | 140 tokens/sec | 236 tokens/sec | | float16 | 235 tokens/sec | 81 tokens/sec | 58 tokens/sec | Pull Request resolved: #122217 Approved by: https://github.com/jansel

pytorch-bot bot added ciflow/inductor module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor labels Mar 19, 2024

malfet requested review from jansel and mikekgfb March 19, 2024 21:08

malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 19, 2024

jansel approved these changes Mar 20, 2024

View reviewed changes

malfet force-pushed the malfet/re-enable-vectorization-on-apple-silicon branch 5 times, most recently from 93e51a7 to 14a0f43 Compare March 25, 2024 03:47

[Reland][Inductor] Add support for NEON ISA For Mac OS

c6f3d57

This is a re-land of #105590 but this time enbaling it only for Darwin platform where those instructions are available by default

malfet changed the title ~~[Reland][Inductor] Add support for NEON ISA For Mac OS~~ [Inductor] Add NEON ISA support on MacOS Mar 25, 2024

malfet changed the title ~~[Inductor] Add NEON ISA support on MacOS~~ [Inductor] Add NEON ISA support on arm64 Macs Mar 25, 2024

malfet force-pushed the malfet/re-enable-vectorization-on-apple-silicon branch from 14a0f43 to c6f3d57 Compare March 25, 2024 22:57

malfet added topic: improvements topic category release notes: inductor labels Mar 25, 2024

pytorchmergebot added the merging label Mar 26, 2024

pytorchmergebot added the Merged label Mar 26, 2024

pytorchmergebot closed this in dd3f2cb Mar 26, 2024

pytorchmergebot removed the merging label Mar 26, 2024

github-actions bot deleted the malfet/re-enable-vectorization-on-apple-silicon branch April 26, 2024 01:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Inductor] Add NEON ISA support on arm64 Macs #122217

[Inductor] Add NEON ISA support on arm64 Macs #122217

Uh oh!

malfet commented Mar 19, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 19, 2024 •

edited

Loading

Uh oh!

malfet commented Mar 26, 2024

Uh oh!

pytorchmergebot commented Mar 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Inductor] Add NEON ISA support on arm64 Macs #122217

[Inductor] Add NEON ISA support on arm64 Macs #122217

Uh oh!

Conversation

malfet commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122217

✅ No Failures

Uh oh!

malfet commented Mar 26, 2024

Uh oh!

pytorchmergebot commented Mar 26, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

malfet commented Mar 19, 2024 •

edited

Loading

pytorch-bot bot commented Mar 19, 2024 •

edited

Loading