-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add NEON ISA support on aarch64 #123584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NEON ISA support on aarch64 #123584
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123584
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit 5274dd7 with merge base 868e5ce ( NEW FAILURE - The following job has failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
jgong5
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we test this? Any UT covers this?
|
@jgong5 thanks for the review. The existing UT suite covers this. The two tests: |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
a3d9027 to
98099d0
Compare
|
@pytorchbot merge |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot label "release notes: inductor" |
|
@pytorchbot merge -f "Lint is green and it was green before" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes pytorch#104729 This improves the compiled mode performance of Softmax (by 20%) and other operations (like batchnorm) that invoke the reduce_all function. Thereby also improves BERT inference by around 8%. Tested on a graviton 3 instance (c7g.4xl). Tests were run in a single-threaded manner. Script attached below. Command: `OMP_NUM_THREADS=1 LRU_CACHE_CAPACITY=1024 DNNL_DEFAULT_FPMATH_MODE=BF16 python TestSoftmax.py` [TestSoftmax.txt](https://github.com/pytorch/pytorch/files/14910754/TestSoftmax.txt) ```python import torch import torch.nn as nn from torch.profiler import profile, record_function, ProfilerActivity model = nn.Softmax().eval() compiled_model = torch.compile(model) inputs = torch.randn(1024, 1024) with torch.set_grad_enabled(False): for _ in range(50): compiled_model(inputs) #Warmup print("Warmup over") with profile(activities=[ProfilerActivity.CPU]) as prof: with record_function("model_inference"): for _ in range(100): compiled_model(inputs) print(prof.key_averages().table(sort_by="self_cpu_time_total")) # Check if the compiled model inference and the eager model inference are similar using torch.allclose print(torch.allclose(compiled_model(inputs), model(inputs))) ``` Co-authored-by: Nikita Shulga <[email protected]> Pull Request resolved: pytorch#123584 Approved by: https://github.com/jgong5, https://github.com/malfet
Fixes pytorch#104729 This improves the compiled mode performance of Softmax (by 20%) and other operations (like batchnorm) that invoke the reduce_all function. Thereby also improves BERT inference by around 8%. Tested on a graviton 3 instance (c7g.4xl). Tests were run in a single-threaded manner. Script attached below. Command: `OMP_NUM_THREADS=1 LRU_CACHE_CAPACITY=1024 DNNL_DEFAULT_FPMATH_MODE=BF16 python TestSoftmax.py` [TestSoftmax.txt](https://github.com/pytorch/pytorch/files/14910754/TestSoftmax.txt) ```python import torch import torch.nn as nn from torch.profiler import profile, record_function, ProfilerActivity model = nn.Softmax().eval() compiled_model = torch.compile(model) inputs = torch.randn(1024, 1024) with torch.set_grad_enabled(False): for _ in range(50): compiled_model(inputs) #Warmup print("Warmup over") with profile(activities=[ProfilerActivity.CPU]) as prof: with record_function("model_inference"): for _ in range(100): compiled_model(inputs) print(prof.key_averages().table(sort_by="self_cpu_time_total")) # Check if the compiled model inference and the eager model inference are similar using torch.allclose print(torch.allclose(compiled_model(inputs), model(inputs))) ``` Co-authored-by: Nikita Shulga <[email protected]> Pull Request resolved: pytorch#123584 Approved by: https://github.com/jgong5, https://github.com/malfet
Fixes #104729
This improves the compiled mode performance of Softmax (by 20%) and other operations (like batchnorm) that invoke the reduce_all function. Thereby also improves BERT inference by around 8%.
Tested on a graviton 3 instance (c7g.4xl). Tests were run in a single-threaded manner.
Script attached below.
Command:
OMP_NUM_THREADS=1 LRU_CACHE_CAPACITY=1024 DNNL_DEFAULT_FPMATH_MODE=BF16 python TestSoftmax.pyTestSoftmax.txt
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang