Skip to content

Conversation

@malfet
Copy link
Contributor

@malfet malfet commented Mar 22, 2024

Discovered while debugging regressions in enabling vectorization on ARM platform

Without this change test_div2_cpu will fail with invalid values on non-x86 CPU

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

Discovered while debugging regressions in enabling vectorization on ARM platform
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 22, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122511

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 78a6f6c with merge base 52e9049 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@malfet malfet requested a review from albanD March 22, 2024 20:46
@malfet malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 22, 2024
@malfet malfet changed the title [CPUInductor] Fix out-of-bounds read/write in cvt_int64_to_fp32 [CPUInductor] Fix out-of-bounds read/write in cvt_int64_to_[fp32|int32] Mar 22, 2024
@malfet
Copy link
Contributor Author

malfet commented Mar 23, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Mar 26, 2024
This started as a re-land of #105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions)

Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS:
- #122511
- #122513
- #122580
- #122608

Following was added/changed to enable vectorization code to work on MacOS
 - Added VecNEON class to `_inductor/codecache.py`  that is supported on all AppleSilicon Macs
 - Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types
 - Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see #118149 for more details)

See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro:
| dtype  | Eager | Compile (before) | Compile (after) |
| ------ | ------ | --------- | --------- |
| bfloat16  | 120 tokens/sec  | 130 tokens/sec | 156 tokens/sec |
| float32  | 158 tokens/sec  | 140 tokens/sec | 236 tokens/sec |
| float16  | 235 tokens/sec  | 81 tokens/sec | 58 tokens/sec |

Pull Request resolved: #122217
Approved by: https://github.com/jansel
pytorch-bot bot pushed a commit that referenced this pull request Apr 22, 2024
…2] (#122511)

Discovered while debugging regressions in enabling vectorization on ARM platform

Without this change `test_div2_cpu` will fail with invalid values on non-x86 CPU

Pull Request resolved: #122511
Approved by: https://github.com/peterbell10, https://github.com/jansel
pytorch-bot bot pushed a commit that referenced this pull request Apr 22, 2024
This started as a re-land of #105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions)

Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS:
- #122511
- #122513
- #122580
- #122608

Following was added/changed to enable vectorization code to work on MacOS
 - Added VecNEON class to `_inductor/codecache.py`  that is supported on all AppleSilicon Macs
 - Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types
 - Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see #118149 for more details)

See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro:
| dtype  | Eager | Compile (before) | Compile (after) |
| ------ | ------ | --------- | --------- |
| bfloat16  | 120 tokens/sec  | 130 tokens/sec | 156 tokens/sec |
| float32  | 158 tokens/sec  | 140 tokens/sec | 236 tokens/sec |
| float16  | 235 tokens/sec  | 81 tokens/sec | 58 tokens/sec |

Pull Request resolved: #122217
Approved by: https://github.com/jansel
@github-actions github-actions bot deleted the malfet-patch-9 branch April 23, 2024 03:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants