Skip to content

Conversation

@malfet
Copy link
Contributor

@malfet malfet commented Mar 19, 2024

This started as a re-land of #105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions)

Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS:

Following was added/changed to enable vectorization code to work on MacOS

  • Added VecNEON class to _inductor/codecache.py that is supported on all AppleSilicon Macs
  • Added Vectorized::loadu_one_fourth to vec_base.h, and limit it to 8-bit types
  • Change 64-bit integral types mapping to int64_t/uint64_t to align with the rest of the code, as on MacOS, int64_t is a long long rather than long (see [C10] Make Scalar constructable from longs #118149 for more details)

See table below for perf changes with and without torch.compile using gpt-fast running stories15M on M2 Pro:

dtype Eager Compile (before) Compile (after)
bfloat16 120 tokens/sec 130 tokens/sec 156 tokens/sec
float32 158 tokens/sec 140 tokens/sec 236 tokens/sec
float16 235 tokens/sec 81 tokens/sec 58 tokens/sec

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 19, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122217

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c6f3d57 with merge base eda279c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor labels Mar 19, 2024
@malfet malfet requested review from jansel and mikekgfb March 19, 2024 21:08
@malfet malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 19, 2024
@malfet malfet force-pushed the malfet/re-enable-vectorization-on-apple-silicon branch 5 times, most recently from 93e51a7 to 14a0f43 Compare March 25, 2024 03:47
This is a re-land of #105590 but
this time enbaling it only for Darwin platform where those instructions
are available by default
@malfet malfet changed the title [Reland][Inductor] Add support for NEON ISA For Mac OS [Inductor] Add NEON ISA support on MacOS Mar 25, 2024
@malfet malfet changed the title [Inductor] Add NEON ISA support on MacOS [Inductor] Add NEON ISA support on arm64 Macs Mar 25, 2024
@malfet malfet force-pushed the malfet/re-enable-vectorization-on-apple-silicon branch from 14a0f43 to c6f3d57 Compare March 25, 2024 22:57
@malfet
Copy link
Contributor Author

malfet commented Mar 26, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorch-bot bot pushed a commit that referenced this pull request Apr 22, 2024
This started as a re-land of #105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions)

Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS:
- #122511
- #122513
- #122580
- #122608

Following was added/changed to enable vectorization code to work on MacOS
 - Added VecNEON class to `_inductor/codecache.py`  that is supported on all AppleSilicon Macs
 - Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types
 - Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see #118149 for more details)

See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro:
| dtype  | Eager | Compile (before) | Compile (after) |
| ------ | ------ | --------- | --------- |
| bfloat16  | 120 tokens/sec  | 130 tokens/sec | 156 tokens/sec |
| float32  | 158 tokens/sec  | 140 tokens/sec | 236 tokens/sec |
| float16  | 235 tokens/sec  | 81 tokens/sec | 58 tokens/sec |

Pull Request resolved: #122217
Approved by: https://github.com/jansel
@github-actions github-actions bot deleted the malfet/re-enable-vectorization-on-apple-silicon branch April 26, 2024 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor release notes: inductor topic: improvements topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants