Pass ideep:lowp_kind to matmul_forward::compute on cache misses #135058

fadara01 · 2024-09-03T21:28:19Z

Optimized dynamic quantization for aarch64 was enabled by #126687 and #134897

This PR fixes an issue for aarch64 where on a cache miss (e.g. if input dimensions change) ideep::matmul_forward::compute (wrongly) runs with the default lowp_kind (u8s8) which is not supported by oneDNN+ACL (Arm Compute Library), causing the workload to fall back to a much slower oneDNN gemm:jit kernel

Example:

import torch

DIM = 4096
INPUT_SIZE1 = 32
INPUT_SIZE2 = 16

class LinearNet(torch.nn.Module):
   def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(DIM, DIM, bias=False)

   def forward(self, x):
        x = self.fc1(x)
        return x

input1 = torch.randn(size=(INPUT_SIZE1, DIM))
input2 = torch.randn(size=(INPUT_SIZE2, DIM))

with torch.no_grad():
    model = LinearNet()
    model =  torch.ao.quantization.quantize_dynamic(model,{torch.nn.Linear})
    
    model(input1)   # this goes to ACL lowp_gemm
    print("="*50)
    model(input2)   # this goes to gemm:jit without this PR, and to ACL with this PR

In the code snippet above:

The matmul from model(input1) goes to oneDNN+ACL (in both cases, with and without the PR)
The matmul from model(input2): Without this PR: there's a cache miss (different input shapes) and matmul_forward::compute is run with the default lowp_kind (u8s8). Hence the matmul falls back to gemm:jit in oneDNN. However, With this PR the matmul goes to oneDNN+ACL which is around 10x faster than oneDNN+jit.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm

This fixes an issue for aarch64 where on a cache miss (e.g. if input dimensions change) ideep::matmul_forward::compute runs with the default lowp_kind (u8s8) which is not supported by oneDNN+ACL, casusing the workload to fall back to a much slower oneDNN gemm:jit kernel

pytorch-bot · 2024-09-03T21:28:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135058

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 511af4e with merge base e7731b3 ():

NEW FAILURE - The following job has failed:

trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (default, 2, 5, linux.4xlarge.nvidia.gpu) (gh)
'test/inductor/test_cudacodecache.py::TestCUDACodeCache::test_cuda_load'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2024-09-03T21:28:23Z

The committers listed above are authorized under a signed CLA.

✅ login: fadara01 / name: Fadi Arafeh (511af4e)

fadara01 · 2024-09-03T21:33:20Z

cc @malfet @atalman @jondea @cfRod @milpuz01 Please kindly review.

fadara01 · 2024-09-03T21:39:40Z

@pytorchbot label "module: arm"

jondea

Great find, thank you!

fadara01 · 2024-09-12T13:20:21Z

@pytorchbot label "ciflow/linux-aarch64"

pytorch-bot · 2024-09-12T13:20:29Z

Can't add following labels to PR: ciflow/linux-aarch64. Please ping one of the reviewers for help.

cfRod · 2024-09-12T14:12:42Z

@pytorchbot label "ciflow/linux-aarch64"

pytorch-bot · 2024-09-12T14:12:51Z

Can't add following labels to PR: ciflow/linux-aarch64. Please ping one of the reviewers for help.

cfRod · 2024-09-12T14:13:29Z

@malfet We cant seem to add CI labels

malfet · 2024-09-12T14:27:57Z

@malfet We cant seem to add CI labels

@cfRod you'll need to approve workflow run first..

malfet · 2024-09-12T14:31:57Z

@pytorchbot merge

pytorchmergebot · 2024-09-12T14:33:55Z

Merge failed

Reason: Approvers from one of the following sets are needed:

CPU ATen backend (mingfeima, XiaobingSuper, jgong5, vfdev-5, leslie-fang-intel)
CPU inductor (leslie-fang-intel, jgong5, EikanWang)
superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

malfet

Thank you for the detailed PR description, my only question will it still work (although slowly) on older ARMv8 platforms like Cortex A75?

fadara01 · 2024-09-13T10:34:00Z

@malfet Thank you for the detailed PR description, my only question will it still work (although slowly) on older ARMv8 platforms like Cortex A75?

Yes, it should work for older Arm platforms too.

malfet · 2024-09-13T17:08:03Z

@pytorchbot revert -m "It regresses x86 performance" -c nosignal

pytorchmergebot · 2024-09-13T17:09:38Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…es (#135058)" This reverts commit 3d24313. Reverted #135058 on behalf of https://github.com/malfet due to It regresses x86 performance ([comment](#135058 (comment)))

pytorchmergebot · 2024-09-13T17:09:48Z

@fadara01 your PR has been successfully reverted.

fadara01 · 2024-09-17T08:50:53Z

@malfet could we please get a reproducer for the regression on x86?
My understanding is that ideep::forward_matmul::compute will be called with the same default arguments on x86 with and without this PR.

malfet · 2024-09-18T14:53:27Z

could we please get a reproducer for the regression on x86?

Sorry, this is an internal test, I can not share full reproducer.

My understanding is that ideep::forward_matmul::compute will be called with the same default arguments on x86 with and without this PR.

I'm not very familiar with this code, but wasn't it calling https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3/include/ideep/operators/matmul.hpp#L63 before your PR, wasn't it?

…rch#135058) Optimized dynamic quantization for aarch64 was enabled by pytorch#126687 and pytorch#134897 This PR fixes an issue for aarch64 where on a [cache miss](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp#L592) (e.g. if input dimensions change) [ideep::matmul_forward::compute ](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L160) (wrongly) runs with the [default lowp_kind (u8s8)](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L174) which is not supported by oneDNN+ACL (Arm Compute Library), causing the workload to fall back to a much slower oneDNN gemm:jit kernel Example: ```python import torch DIM = 4096 INPUT_SIZE1 = 32 INPUT_SIZE2 = 16 class LinearNet(torch.nn.Module): def __init__(self): super().__init__() self.fc1 = torch.nn.Linear(DIM, DIM, bias=False) def forward(self, x): x = self.fc1(x) return x input1 = torch.randn(size=(INPUT_SIZE1, DIM)) input2 = torch.randn(size=(INPUT_SIZE2, DIM)) with torch.no_grad(): model = LinearNet() model = torch.ao.quantization.quantize_dynamic(model,{torch.nn.Linear}) model(input1) # this goes to ACL lowp_gemm print("="*50) model(input2) # this goes to gemm:jit without this PR, and to ACL with this PR ``` In the code snippet above: - The matmul from `model(input1)` goes to oneDNN+ACL (in both cases, with and without the PR) - The matmul from `model(input2)`: **Without this PR**: there's a cache miss (different input shapes) and matmul_forward::compute is run with the default lowp_kind (u8s8). Hence the matmul falls back to gemm:jit in oneDNN. However, **With this PR** the matmul goes to oneDNN+ACL which is around 10x faster than oneDNN+jit. Pull Request resolved: pytorch#135058 Approved by: https://github.com/jondea, https://github.com/malfet

…es (pytorch#135058)" This reverts commit 3d24313. Reverted pytorch#135058 on behalf of https://github.com/malfet due to It regresses x86 performance ([comment](pytorch#135058 (comment)))

fadara01 · 2024-11-04T12:01:20Z

I'm not very familiar with this code, but wasn't it calling https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3/include/ideep/operators/matmul.hpp#L63 before your PR, wasn't it?

@malfet I don't think this is the case, as the ideep::matmul_forward::compute function you mentioned does not match the arguments we're passing from qlinear_dynamic.cpp in the cache miss (else) branch.

Are you using a vanilla wheel for your internal tests with x86?

If I pip install a wheel on x86, this path (ideep/oneDNN) is not even selected for dynamic quantization (can be deduced by the lack of oneDNN verbose when running with the environment variable ONEDNN_VERBOSE=1).
I think the fbgemm path is chosen instead.

@Chao1Han do you have any insights on why this might cause regressions on x86?

I'm happy to wrap the new arguments with an #ifdef __aarch64__ but I still can't understand why that is necessary.

fadara01 · 2024-11-29T14:34:36Z

@malfet , Any updates or thoughts?

fadara01 · 2024-12-28T15:41:56Z

@malfet, I built torch on x86 with USE_FBGEMM=0 to start exercising this ideep path and confirmed with print statements that this function gets called before and after my PR. Hence, I do not understand how this is causing regressions.

Could you please have a more serious look at this?
It brings 10x speedups for LLMs on aarch64, since all matmuls after prefill are currently getting dispatched to sub-optimal implementations due to the [wrong] default u8s8 lowp_kind.

robert-hardwick · 2025-01-20T11:53:16Z

Raised an issue that we have seen which will be fixed by this change #145216

Not a regression. Those tests have not been enabled before in CI.

robert-hardwick · 2025-01-23T12:15:40Z

@pytorchbot rebase

pytorchmergebot · 2025-01-23T12:17:09Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-01-23T12:17:11Z

Rebase failed due to

Aborting rebase because rebasing the branch resulted in the same sha as the target branch.
This usually happens because the PR has already been merged.  Please rebase locally and push.

Raised by https://github.com/pytorch/pytorch/actions/runs/12928955569

robert-hardwick · 2025-01-23T18:47:59Z

i think the rebase failed because this has been previously merged and subsequently reverted

github-actions · 2025-03-24T19:34:03Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

fadara01 requested review from digantdesai, jerryzh168, jianyuh, kimishpatel and salilsdesai as code owners September 3, 2024 21:28

pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: quantization release notes category labels Sep 3, 2024

pytorchbot added the open source label Sep 3, 2024

pytorch-bot bot added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Sep 3, 2024

jondea approved these changes Sep 4, 2024

View reviewed changes

soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 6, 2024

malfet added topic: bug fixes topic category ciflow/linux-aarch64 linux aarch64 CI workflow labels Sep 12, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 12, 2024

pytorchmergebot added the merging label Sep 12, 2024

pytorchmergebot removed the merging label Sep 12, 2024

malfet approved these changes Sep 12, 2024

View reviewed changes

pytorchmergebot added the Merged label Sep 12, 2024

pytorchmergebot closed this in 3d24313 Sep 12, 2024

pytorchmergebot removed the merging label Sep 12, 2024

pytorchmergebot added the Reverted label Sep 13, 2024

pytorchmergebot reopened this Sep 13, 2024

jgong5 approved these changes Nov 5, 2024

View reviewed changes

robert-hardwick mentioned this pull request Jan 20, 2025

[ARM] - test_quantized_module.py test_lstm_api fails on Aarch64 #145216

Closed

fadara01 requested a review from malfet January 23, 2025 15:25

github-actions bot added the Stale label Mar 24, 2025

github-actions bot closed this Apr 23, 2025

Pass ideep:lowp_kind to matmul_forward::compute on cache misses #135058

Pass ideep:lowp_kind to matmul_forward::compute on cache misses #135058

Uh oh!

Conversation

fadara01 commented Sep 3, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135058

❌ 1 New Failure

Uh oh!

linux-foundation-easycla bot commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fadara01 commented Sep 3, 2024

Uh oh!

fadara01 commented Sep 3, 2024

Uh oh!

jondea left a comment

Choose a reason for hiding this comment

Uh oh!

fadara01 commented Sep 12, 2024

Uh oh!

pytorch-bot bot commented Sep 12, 2024

Uh oh!

cfRod commented Sep 12, 2024

Uh oh!

pytorch-bot bot commented Sep 12, 2024

Uh oh!

cfRod commented Sep 12, 2024

Uh oh!

malfet commented Sep 12, 2024

Uh oh!

malfet commented Sep 12, 2024

Uh oh!

pytorchmergebot commented Sep 12, 2024

Merge failed

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

fadara01 commented Sep 13, 2024

Uh oh!

malfet commented Sep 13, 2024

Uh oh!

pytorchmergebot commented Sep 13, 2024

Uh oh!

pytorchmergebot commented Sep 13, 2024

Uh oh!

fadara01 commented Sep 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malfet commented Sep 18, 2024

Uh oh!

fadara01 commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fadara01 commented Nov 29, 2024

Uh oh!

fadara01 commented Dec 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robert-hardwick commented Jan 20, 2025

Uh oh!

robert-hardwick commented Jan 23, 2025

Uh oh!

pytorchmergebot commented Jan 23, 2025

Uh oh!

pytorchmergebot commented Jan 23, 2025

Uh oh!

robert-hardwick commented Jan 23, 2025

Uh oh!

github-actions bot commented Mar 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

fadara01 commented Sep 3, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 3, 2024 •

edited

Loading

linux-foundation-easycla bot commented Sep 3, 2024 •

edited

Loading

fadara01 commented Sep 17, 2024 •

edited

Loading

fadara01 commented Nov 4, 2024 •

edited

Loading

fadara01 commented Dec 28, 2024 •

edited

Loading