Update submodule ideep to include aarch64 change #134897

yanbing-j · 2024-08-31T01:54:42Z

This PR is per ARM request, which is in intel/ideep#334.

Context for the request is: Arm team has upstreamed the dynamic quantization changes, all the PRs were merged (torch, ideep, oneDNN), but without this ideep submodule update, the feature will not work. The change is isolated to only matmul operator and quantization path alone.

cc @gujinghui @PenghuiCheng @XiaobingSuper @jianyuh @jgong5 @mingfeima @sanchitintel @ashokei @jingxu10 @min-jean-cho @Guobing-Chen @Xia-Weiwen @snadampal @malfet @milpuz01

pytorch-bot · 2024-08-31T01:54:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134897

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit eee651e with merge base 85fa019 ():

NEW FAILURE - The following job has failed:

linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-test / test (gh)
RuntimeError: Didn't find enought cxx11 symbols

FLAKY - The following job failed but was likely due to flakiness present on trunk:

linux-binary-libtorch-pre-cxx11 / libtorch-cpu-shared-with-deps-pre-cxx11-test / test (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

milpuz01 · 2024-09-02T12:25:13Z

@pytorchbot label "arm"

pytorch-bot · 2024-09-02T12:25:19Z

Didn't find following labels among repository labels: arm

milpuz01 · 2024-09-02T12:25:41Z

@pytorchbot label "module: arm"

yanbing-j · 2024-09-04T01:52:18Z

Hi @snadampal @milpuz01 Please provide ARM test results.
Hi @malfet @atalman Please kindly review.

yanbing-j · 2024-09-04T01:52:41Z

@pytorchbot rebase

pytorchmergebot · 2024-09-04T01:54:07Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-09-04T01:54:10Z

Successfully rebased yanbing/update_ideep onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout yanbing/update_ideep && git pull --rebase)

fadara01 · 2024-09-04T11:34:05Z

Our acceptance tests currently do not include dynamic quantization.
Given that this change only affects the dynamic quantization path and does not change the version of oneDNN, running the full acceptance tests is not required.

I manually verified this PR and can confirm that (as expected) oneDNN calls Arm Compute Library's optimized lowp gemm kernels and on 16 Neoverse-V1 cores, the speedup for bert-large is as follows:

context length	bert-large speedup with this PR
8	20.5x
16	26.9x
32	31.1x
64	40.7x
128	53.1x
256	50.0x
512	27.2x

I think the manual tests above and the CI tests are enough for us to to merge this PR

cc: @snadampal @milpuz01 @malfet @atalman @yanbing-j

snadampal · 2024-09-04T16:54:20Z

@fadara01 thanks for the data! the two CI failures don't seem be to related to this PR.

yanbing-j · 2024-09-05T04:34:28Z

@fadara01 Thanks for the data!

@malfet @atalman Could you please help review this PR? Thanks!

atalman

lgtm

snadampal

looks good to me.

atalman · 2024-09-06T16:38:47Z

@pytorchmergebot merge -f "failures are not related"

pytorchmergebot · 2024-09-06T16:40:19Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

) Optimized dynamic quantization for aarch64 was enabled by #126687 and #134897 This PR fixes an issue for aarch64 where on a [cache miss](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp#L592) (e.g. if input dimensions change) [ideep::matmul_forward::compute ](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L160) (wrongly) runs with the [default lowp_kind (u8s8)](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L174) which is not supported by oneDNN+ACL (Arm Compute Library), causing the workload to fall back to a much slower oneDNN gemm:jit kernel Example: ```python import torch DIM = 4096 INPUT_SIZE1 = 32 INPUT_SIZE2 = 16 class LinearNet(torch.nn.Module): def __init__(self): super().__init__() self.fc1 = torch.nn.Linear(DIM, DIM, bias=False) def forward(self, x): x = self.fc1(x) return x input1 = torch.randn(size=(INPUT_SIZE1, DIM)) input2 = torch.randn(size=(INPUT_SIZE2, DIM)) with torch.no_grad(): model = LinearNet() model = torch.ao.quantization.quantize_dynamic(model,{torch.nn.Linear}) model(input1) # this goes to ACL lowp_gemm print("="*50) model(input2) # this goes to gemm:jit without this PR, and to ACL with this PR ``` In the code snippet above: - The matmul from `model(input1)` goes to oneDNN+ACL (in both cases, with and without the PR) - The matmul from `model(input2)`: **Without this PR**: there's a cache miss (different input shapes) and matmul_forward::compute is run with the default lowp_kind (u8s8). Hence the matmul falls back to gemm:jit in oneDNN. However, **With this PR** the matmul goes to oneDNN+ACL which is around 10x faster than oneDNN+jit. Pull Request resolved: #135058 Approved by: https://github.com/jondea, https://github.com/malfet

This PR is per ARM request, which is in intel/ideep#334. Context for the request is: Arm team has upstreamed the dynamic quantization changes, all the PRs were merged (torch, ideep, oneDNN), but without this ideep submodule update, the feature will not work. The change is isolated to only matmul operator and quantization path alone. Pull Request resolved: pytorch#134897 Approved by: https://github.com/jgong5, https://github.com/atalman, https://github.com/snadampal

…rch#135058) Optimized dynamic quantization for aarch64 was enabled by pytorch#126687 and pytorch#134897 This PR fixes an issue for aarch64 where on a [cache miss](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp#L592) (e.g. if input dimensions change) [ideep::matmul_forward::compute ](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L160) (wrongly) runs with the [default lowp_kind (u8s8)](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L174) which is not supported by oneDNN+ACL (Arm Compute Library), causing the workload to fall back to a much slower oneDNN gemm:jit kernel Example: ```python import torch DIM = 4096 INPUT_SIZE1 = 32 INPUT_SIZE2 = 16 class LinearNet(torch.nn.Module): def __init__(self): super().__init__() self.fc1 = torch.nn.Linear(DIM, DIM, bias=False) def forward(self, x): x = self.fc1(x) return x input1 = torch.randn(size=(INPUT_SIZE1, DIM)) input2 = torch.randn(size=(INPUT_SIZE2, DIM)) with torch.no_grad(): model = LinearNet() model = torch.ao.quantization.quantize_dynamic(model,{torch.nn.Linear}) model(input1) # this goes to ACL lowp_gemm print("="*50) model(input2) # this goes to gemm:jit without this PR, and to ACL with this PR ``` In the code snippet above: - The matmul from `model(input1)` goes to oneDNN+ACL (in both cases, with and without the PR) - The matmul from `model(input2)`: **Without this PR**: there's a cache miss (different input shapes) and matmul_forward::compute is run with the default lowp_kind (u8s8). Hence the matmul falls back to gemm:jit in oneDNN. However, **With this PR** the matmul goes to oneDNN+ACL which is around 10x faster than oneDNN+jit. Pull Request resolved: pytorch#135058 Approved by: https://github.com/jondea, https://github.com/malfet

This PR is per ARM request, which is in intel/ideep#334. Context for the request is: Arm team has upstreamed the dynamic quantization changes, all the PRs were merged (torch, ideep, oneDNN), but without this ideep submodule update, the feature will not work. The change is isolated to only matmul operator and quantization path alone. Pull Request resolved: pytorch#134897 Approved by: https://github.com/jgong5, https://github.com/atalman, https://github.com/snadampal

…rch#135058) Optimized dynamic quantization for aarch64 was enabled by pytorch#126687 and pytorch#134897 This PR fixes an issue for aarch64 where on a [cache miss](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp#L592) (e.g. if input dimensions change) [ideep::matmul_forward::compute ](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L160) (wrongly) runs with the [default lowp_kind (u8s8)](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L174) which is not supported by oneDNN+ACL (Arm Compute Library), causing the workload to fall back to a much slower oneDNN gemm:jit kernel Example: ```python import torch DIM = 4096 INPUT_SIZE1 = 32 INPUT_SIZE2 = 16 class LinearNet(torch.nn.Module): def __init__(self): super().__init__() self.fc1 = torch.nn.Linear(DIM, DIM, bias=False) def forward(self, x): x = self.fc1(x) return x input1 = torch.randn(size=(INPUT_SIZE1, DIM)) input2 = torch.randn(size=(INPUT_SIZE2, DIM)) with torch.no_grad(): model = LinearNet() model = torch.ao.quantization.quantize_dynamic(model,{torch.nn.Linear}) model(input1) # this goes to ACL lowp_gemm print("="*50) model(input2) # this goes to gemm:jit without this PR, and to ACL with this PR ``` In the code snippet above: - The matmul from `model(input1)` goes to oneDNN+ACL (in both cases, with and without the PR) - The matmul from `model(input2)`: **Without this PR**: there's a cache miss (different input shapes) and matmul_forward::compute is run with the default lowp_kind (u8s8). Hence the matmul falls back to gemm:jit in oneDNN. However, **With this PR** the matmul goes to oneDNN+ACL which is around 10x faster than oneDNN+jit. Pull Request resolved: pytorch#135058 Approved by: https://github.com/jondea, https://github.com/malfet

pytorch-bot bot added ciflow/linux-aarch64 linux aarch64 CI workflow module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration topic: not user facing topic category labels Aug 31, 2024

yanbing-j added ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel labels Aug 31, 2024

pytorchbot added the open source label Aug 31, 2024

yanbing-j mentioned this pull request Aug 31, 2024

ideep submodule update required for PyTorch 2.5 release intel/ideep#334

Closed

yanbing-j self-assigned this Aug 31, 2024

yanbing-j requested review from Guobing-Chen and jgong5 August 31, 2024 02:21

pytorch-bot bot added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Sep 2, 2024

yanbing-j marked this pull request as ready for review September 3, 2024 01:03

yanbing-j requested review from atalman and malfet September 3, 2024 01:04

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 3, 2024

fadara01 mentioned this pull request Sep 3, 2024

Pass ideep:lowp_kind to matmul_forward::compute on cache misses #135058

Closed

Update ideep to include aarch64 change

eee651e

pytorchmergebot force-pushed the yanbing/update_ideep branch from 1549dde to eee651e Compare September 4, 2024 01:54

jgong5 approved these changes Sep 5, 2024

View reviewed changes

atalman approved these changes Sep 6, 2024

View reviewed changes

snadampal self-requested a review September 6, 2024 14:54

snadampal approved these changes Sep 6, 2024

View reviewed changes

pytorchmergebot added the merging label Sep 6, 2024

pytorchmergebot added the Merged label Sep 6, 2024

pytorchmergebot closed this in c0ec599 Sep 6, 2024

pytorchmergebot removed the merging label Sep 6, 2024

Update submodule ideep to include aarch64 change #134897

Update submodule ideep to include aarch64 change #134897

Uh oh!

Conversation

yanbing-j commented Aug 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134897

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

milpuz01 commented Sep 2, 2024

Uh oh!

pytorch-bot bot commented Sep 2, 2024

Uh oh!

milpuz01 commented Sep 2, 2024

Uh oh!

yanbing-j commented Sep 4, 2024

Uh oh!

yanbing-j commented Sep 4, 2024

Uh oh!

pytorchmergebot commented Sep 4, 2024

Uh oh!

pytorchmergebot commented Sep 4, 2024

Uh oh!

fadara01 commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snadampal commented Sep 4, 2024

Uh oh!

yanbing-j commented Sep 5, 2024

Uh oh!

atalman left a comment

Choose a reason for hiding this comment

Uh oh!

snadampal left a comment

Choose a reason for hiding this comment

Uh oh!

atalman commented Sep 6, 2024

Uh oh!

pytorchmergebot commented Sep 6, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

yanbing-j commented Aug 31, 2024 •

edited

Loading

pytorch-bot bot commented Aug 31, 2024 •

edited

Loading

fadara01 commented Sep 4, 2024 •

edited

Loading