[CPUInductor] Fix out-of-bounds read/write in cvt_int64_to_[fp32|int32] #122511

malfet · 2024-03-22T18:36:43Z

Discovered while debugging regressions in enabling vectorization on ARM platform

Without this change test_div2_cpu will fail with invalid values on non-x86 CPU

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

Discovered while debugging regressions in enabling vectorization on ARM platform

pytorch-bot · 2024-03-22T18:36:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122511

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 78a6f6c with merge base 52e9049 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

malfet · 2024-03-23T01:43:01Z

@pytorchbot merge

pytorchmergebot · 2024-03-23T01:44:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This started as a re-land of #105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions) Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS: - #122511 - #122513 - #122580 - #122608 Following was added/changed to enable vectorization code to work on MacOS - Added VecNEON class to `_inductor/codecache.py` that is supported on all AppleSilicon Macs - Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types - Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see #118149 for more details) See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro: | dtype | Eager | Compile (before) | Compile (after) | | ------ | ------ | --------- | --------- | | bfloat16 | 120 tokens/sec | 130 tokens/sec | 156 tokens/sec | | float32 | 158 tokens/sec | 140 tokens/sec | 236 tokens/sec | | float16 | 235 tokens/sec | 81 tokens/sec | 58 tokens/sec | Pull Request resolved: #122217 Approved by: https://github.com/jansel

…2] (#122511) Discovered while debugging regressions in enabling vectorization on ARM platform Without this change `test_div2_cpu` will fail with invalid values on non-x86 CPU Pull Request resolved: #122511 Approved by: https://github.com/peterbell10, https://github.com/jansel

This started as a re-land of #105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions) Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS: - #122511 - #122513 - #122580 - #122608 Following was added/changed to enable vectorization code to work on MacOS - Added VecNEON class to `_inductor/codecache.py` that is supported on all AppleSilicon Macs - Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types - Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see #118149 for more details) See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro: | dtype | Eager | Compile (before) | Compile (after) | | ------ | ------ | --------- | --------- | | bfloat16 | 120 tokens/sec | 130 tokens/sec | 156 tokens/sec | | float32 | 158 tokens/sec | 140 tokens/sec | 236 tokens/sec | | float16 | 235 tokens/sec | 81 tokens/sec | 58 tokens/sec | Pull Request resolved: #122217 Approved by: https://github.com/jansel

[CPUInductor] Fix out-of-bounds read/write in cvt_int64_to_fp32

83dcb63

Discovered while debugging regressions in enabling vectorization on ARM platform

malfet added topic: bug fixes topic category release notes: inductor labels Mar 22, 2024

malfet requested review from jansel and kit1980 March 22, 2024 18:36

pytorch-bot bot added ciflow/inductor module: inductor labels Mar 22, 2024

malfet requested a review from albanD March 22, 2024 20:46

malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 22, 2024

peterbell10 approved these changes Mar 22, 2024

View reviewed changes

malfet changed the title ~~[CPUInductor] Fix out-of-bounds read/write in cvt_int64_to_fp32~~ [CPUInductor] Fix out-of-bounds read/write in cvt_int64_to_[fp32|int32] Mar 22, 2024

Update cpp_prefix.h

78a6f6c

jansel approved these changes Mar 22, 2024

View reviewed changes

pytorchmergebot added the merging label Mar 23, 2024

pytorchmergebot added the Merged label Mar 23, 2024

pytorchmergebot closed this in 19d27a1 Mar 23, 2024

pytorchmergebot removed the merging label Mar 23, 2024

malfet mentioned this pull request Mar 25, 2024

[Inductor] Add NEON ISA support on arm64 Macs #122217

Closed

malfet mentioned this pull request Mar 27, 2024

[inductor][cpp] unified the vectorized conversion with at::vec::convert for all data types #119979

Closed

github-actions bot deleted the malfet-patch-9 branch April 23, 2024 03:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CPUInductor] Fix out-of-bounds read/write in cvt_int64_to_[fp32|int32] #122511

[CPUInductor] Fix out-of-bounds read/write in cvt_int64_to_[fp32|int32] #122511

Uh oh!

malfet commented Mar 22, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Mar 22, 2024 •

edited

Loading

Uh oh!

malfet commented Mar 23, 2024

Uh oh!

pytorchmergebot commented Mar 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[CPUInductor] Fix out-of-bounds read/write in cvt_int64_to_[fp32|int32] #122511

[CPUInductor] Fix out-of-bounds read/write in cvt_int64_to_[fp32|int32] #122511

Uh oh!

Conversation

malfet commented Mar 22, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122511

✅ No Failures

Uh oh!

malfet commented Mar 23, 2024

Uh oh!

pytorchmergebot commented Mar 23, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

malfet commented Mar 22, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 22, 2024 •

edited

Loading