Add NEON ISA support on aarch64 #123584

Rohanjames1997 · 2024-04-08T20:46:01Z

This improves the compiled mode performance of Softmax (by 20%) and other operations (like batchnorm) that invoke the reduce_all function. Thereby also improves BERT inference by around 8%.

Tested on a graviton 3 instance (c7g.4xl). Tests were run in a single-threaded manner.

Script attached below.
Command: OMP_NUM_THREADS=1 LRU_CACHE_CAPACITY=1024 DNNL_DEFAULT_FPMATH_MODE=BF16 python TestSoftmax.py
TestSoftmax.txt

import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity

model = nn.Softmax().eval()
compiled_model = torch.compile(model)
inputs = torch.randn(1024, 1024)

with torch.set_grad_enabled(False):
    for _ in range(50):
        compiled_model(inputs) #Warmup
    print("Warmup over")
    with profile(activities=[ProfilerActivity.CPU]) as prof:
        with record_function("model_inference"):
            for _ in range(100):
                compiled_model(inputs)

print(prof.key_averages().table(sort_by="self_cpu_time_total"))
# Check if the compiled model inference and the eager model inference are similar using torch.allclose
print(torch.allclose(compiled_model(inputs), model(inputs)))

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

pytorch-bot · 2024-04-08T20:46:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123584

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 5274dd7 with merge base 868e5ce ():

NEW FAILURE - The following job has failed:

pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh)
Process completed with exit code 1.

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-py3.12-clang10 / test (dynamo, 1, 3, linux.2xlarge) (gh)
RuntimeError: test_torch 1/1 failed
pull / linux-focal-py3.12-clang10 / test (dynamo, 2, 3, linux.2xlarge) (gh)
RuntimeError: test_fx 1/1 failed

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jgong5

How do we test this? Any UT covers this?

Rohanjames1997 · 2024-04-09T18:53:11Z

@jgong5 thanks for the review. The existing UT suite covers this.

The two tests: FunctionalTestsReducedFloat/0.Reduce and FunctionalTestsReducedFloat/1.Reduce in aten/src/ATen/test/vec_test_all_types.cpp pass with my changes. And I'm able to fail these two tests intentionally by breaking my implementation.

aten/src/ATen/cpu/vec/vec256/vec256_half_neon.h

Rohanjames1997 · 2024-04-12T06:36:04Z

@pytorchbot rebase

pytorchmergebot · 2024-04-12T06:37:36Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-04-12T06:37:41Z

Successfully rebased reland-vec-reduce-all onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout reland-vec-reduce-all && git pull --rebase)

malfet · 2024-04-16T02:19:17Z

@pytorchbot merge

pytorchmergebot · 2024-04-16T02:21:05Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

Rohanjames1997 · 2024-04-16T03:46:50Z

@pytorchbot label "release notes: inductor"

torch/_inductor/codecache.py

malfet · 2024-04-16T18:47:59Z

@pytorchbot merge -f "Lint is green and it was green before"

pytorchmergebot · 2024-04-16T18:49:42Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes pytorch#104729 This improves the compiled mode performance of Softmax (by 20%) and other operations (like batchnorm) that invoke the reduce_all function. Thereby also improves BERT inference by around 8%. Tested on a graviton 3 instance (c7g.4xl). Tests were run in a single-threaded manner. Script attached below. Command: `OMP_NUM_THREADS=1 LRU_CACHE_CAPACITY=1024 DNNL_DEFAULT_FPMATH_MODE=BF16 python TestSoftmax.py` [TestSoftmax.txt](https://github.com/pytorch/pytorch/files/14910754/TestSoftmax.txt) ```python import torch import torch.nn as nn from torch.profiler import profile, record_function, ProfilerActivity model = nn.Softmax().eval() compiled_model = torch.compile(model) inputs = torch.randn(1024, 1024) with torch.set_grad_enabled(False): for _ in range(50): compiled_model(inputs) #Warmup print("Warmup over") with profile(activities=[ProfilerActivity.CPU]) as prof: with record_function("model_inference"): for _ in range(100): compiled_model(inputs) print(prof.key_averages().table(sort_by="self_cpu_time_total")) # Check if the compiled model inference and the eager model inference are similar using torch.allclose print(torch.allclose(compiled_model(inputs), model(inputs))) ``` Co-authored-by: Nikita Shulga <[email protected]> Pull Request resolved: pytorch#123584 Approved by: https://github.com/jgong5, https://github.com/malfet

pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor labels Apr 8, 2024

Rohanjames1997 mentioned this pull request Apr 8, 2024

[Inductor] Add support for NEON ISA in the Inductor C++ backend #105590

Closed

pytorchbot added the open source label Apr 8, 2024

Rohanjames1997 mentioned this pull request Apr 8, 2024

[NEON] Add VecReduceAllSIMD #122909

Closed

jgong5 approved these changes Apr 9, 2024

View reviewed changes

jgong5 requested a review from malfet April 9, 2024 21:45

pytorch-bot bot added the ciflow/inductor label Apr 11, 2024

malfet reviewed Apr 11, 2024

View reviewed changes

aten/src/ATen/cpu/vec/vec256/vec256_half_neon.h Show resolved Hide resolved

Rohanjames1997 mentioned this pull request Apr 11, 2024

Add UT for NEON implementation of vec_reduce_all #105823

Closed

Rohanjames1997 added 3 commits April 12, 2024 06:37

Implement NEON vectorized reduce all

5745dae

Enable NEON vectorization for all NEON processors

b0a4ea7

Remove trailing whitespace

98099d0

pytorchmergebot force-pushed the reland-vec-reduce-all branch from a3d9027 to 98099d0 Compare April 12, 2024 06:37

add comment about cpuinfo flag "asimd"

dc779cf

malfet approved these changes Apr 16, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 16, 2024

pytorchmergebot added the merging label Apr 16, 2024

pytorchmergebot removed the merging label Apr 16, 2024

pytorch-bot bot added the release notes: inductor label Apr 16, 2024

malfet reviewed Apr 16, 2024

View reviewed changes

torch/_inductor/codecache.py Outdated Show resolved Hide resolved

Update torch/_inductor/codecache.py

bc2e30a

abhishek-iitmadras reviewed Apr 16, 2024

View reviewed changes

torch/_inductor/codecache.py Outdated Show resolved Hide resolved

Update codecache.py

5274dd7

pytorchmergebot added the merging label Apr 16, 2024

pytorchmergebot closed this in 72271fb Apr 16, 2024

pytorchmergebot added Merged and removed merging labels Apr 16, 2024

Add NEON ISA support on aarch64 #123584

Add NEON ISA support on aarch64 #123584

Uh oh!

Conversation

Rohanjames1997 commented Apr 8, 2024 • edited by malfet Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123584

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

jgong5 left a comment

Choose a reason for hiding this comment

Uh oh!

Rohanjames1997 commented Apr 9, 2024

Uh oh!

Uh oh!

Rohanjames1997 commented Apr 12, 2024

Uh oh!

pytorchmergebot commented Apr 12, 2024

Uh oh!

pytorchmergebot commented Apr 12, 2024

Uh oh!

malfet commented Apr 16, 2024

Uh oh!

pytorchmergebot commented Apr 16, 2024

Merge failed

Uh oh!

Rohanjames1997 commented Apr 16, 2024

Uh oh!

Uh oh!

Uh oh!

malfet commented Apr 16, 2024

Uh oh!

pytorchmergebot commented Apr 16, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Rohanjames1997 commented Apr 8, 2024 •

edited by malfet

Loading

pytorch-bot bot commented Apr 8, 2024 •

edited

Loading