-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Pass ideep:lowp_kind to matmul_forward::compute on cache misses #135058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This fixes an issue for aarch64 where on a cache miss (e.g. if input dimensions change) ideep::matmul_forward::compute runs with the default lowp_kind (u8s8) which is not supported by oneDNN+ACL, casusing the workload to fall back to a much slower oneDNN gemm:jit kernel
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135058
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 511af4e with merge base e7731b3 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
|
|
@pytorchbot label "module: arm" |
jondea
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find, thank you!
|
@pytorchbot label "ciflow/linux-aarch64" |
|
Can't add following labels to PR: ciflow/linux-aarch64. Please ping one of the reviewers for help. |
|
@pytorchbot label "ciflow/linux-aarch64" |
|
Can't add following labels to PR: ciflow/linux-aarch64. Please ping one of the reviewers for help. |
|
@malfet We cant seem to add CI labels |
|
@pytorchbot merge |
Merge failedReason: Approvers from one of the following sets are needed:
|
malfet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed PR description, my only question will it still work (although slowly) on older ARMv8 platforms like Cortex A75?
Yes, it should work for older Arm platforms too. |
|
@pytorchbot revert -m "It regresses x86 performance" -c nosignal |
|
@pytorchbot successfully started a revert job. Check the current status here. |
…es (#135058)" This reverts commit 3d24313. Reverted #135058 on behalf of https://github.com/malfet due to It regresses x86 performance ([comment](#135058 (comment)))
|
@fadara01 your PR has been successfully reverted. |
|
@malfet could we please get a reproducer for the regression on x86? |
Sorry, this is an internal test, I can not share full reproducer.
I'm not very familiar with this code, but wasn't it calling https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3/include/ideep/operators/matmul.hpp#L63 before your PR, wasn't it? |
…rch#135058) Optimized dynamic quantization for aarch64 was enabled by pytorch#126687 and pytorch#134897 This PR fixes an issue for aarch64 where on a [cache miss](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp#L592) (e.g. if input dimensions change) [ideep::matmul_forward::compute ](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L160) (wrongly) runs with the [default lowp_kind (u8s8)](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L174) which is not supported by oneDNN+ACL (Arm Compute Library), causing the workload to fall back to a much slower oneDNN gemm:jit kernel Example: ```python import torch DIM = 4096 INPUT_SIZE1 = 32 INPUT_SIZE2 = 16 class LinearNet(torch.nn.Module): def __init__(self): super().__init__() self.fc1 = torch.nn.Linear(DIM, DIM, bias=False) def forward(self, x): x = self.fc1(x) return x input1 = torch.randn(size=(INPUT_SIZE1, DIM)) input2 = torch.randn(size=(INPUT_SIZE2, DIM)) with torch.no_grad(): model = LinearNet() model = torch.ao.quantization.quantize_dynamic(model,{torch.nn.Linear}) model(input1) # this goes to ACL lowp_gemm print("="*50) model(input2) # this goes to gemm:jit without this PR, and to ACL with this PR ``` In the code snippet above: - The matmul from `model(input1)` goes to oneDNN+ACL (in both cases, with and without the PR) - The matmul from `model(input2)`: **Without this PR**: there's a cache miss (different input shapes) and matmul_forward::compute is run with the default lowp_kind (u8s8). Hence the matmul falls back to gemm:jit in oneDNN. However, **With this PR** the matmul goes to oneDNN+ACL which is around 10x faster than oneDNN+jit. Pull Request resolved: pytorch#135058 Approved by: https://github.com/jondea, https://github.com/malfet
…es (pytorch#135058)" This reverts commit 3d24313. Reverted pytorch#135058 on behalf of https://github.com/malfet due to It regresses x86 performance ([comment](pytorch#135058 (comment)))
@malfet I don't think this is the case, as the ideep::matmul_forward::compute function you mentioned does not match the arguments we're passing from qlinear_dynamic.cpp in the cache miss (else) branch. Are you using a vanilla wheel for your internal tests with x86? If I pip install a wheel on x86, this path (ideep/oneDNN) is not even selected for dynamic quantization (can be deduced by the lack of oneDNN verbose when running with the environment variable ONEDNN_VERBOSE=1). @Chao1Han do you have any insights on why this might cause regressions on x86? I'm happy to wrap the new arguments with an |
|
@malfet , Any updates or thoughts? |
|
@malfet, I built torch on x86 with Could you please have a more serious look at this? |
|
Raised an issue that we have seen which will be fixed by this change #145216 Not a regression. Those tests have not been enabled before in CI. |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Rebase failed due to |
|
i think the rebase failed because this has been previously merged and subsequently reverted |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Optimized dynamic quantization for aarch64 was enabled by #126687 and #134897
This PR fixes an issue for aarch64 where on a cache miss (e.g. if input dimensions change) ideep::matmul_forward::compute (wrongly) runs with the default lowp_kind (u8s8) which is not supported by oneDNN+ACL (Arm Compute Library), causing the workload to fall back to a much slower oneDNN gemm:jit kernel
Example:
In the code snippet above:
model(input1)goes to oneDNN+ACL (in both cases, with and without the PR)model(input2): Without this PR: there's a cache miss (different input shapes) and matmul_forward::compute is run with the default lowp_kind (u8s8). Hence the matmul falls back to gemm:jit in oneDNN. However, With this PR the matmul goes to oneDNN+ACL which is around 10x faster than oneDNN+jit.cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm