Enable fast qlinear static/dynamic path for AArch64 through ACL directly #148583

fadara01 · 2025-03-05T18:43:34Z

Stack from ghstack (oldest at bottom):

This enables a fast path for eager mode dynamic quantization for AArch64 through Arm Compute Library (ACL) directly.

Context: PR #126687 enabled an optimized implementation for qlinear_dynamic for aarch64 through ideep → oneDNN → ACL which improved performance by ~10x compared to the previous implementation.
However, the current qlinear_dynamic path (ideep → oneDNN → ACL) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (lowp_gemm) API - for example, ACL's lowp_gemm objects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature.
Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kerne's optimal format) for each GEMM operation.
This PR addresses the sub-optimalities above by integrating ACL directly with qlinear_dynamic. This approach yields an average speedup (averaged over context_lengths of 2^3 up to 2^9) of ~ 50% for bert-base-uncased, bert-large-uncased, roberta-base, distilbert-base-uncased with 16 threads on a Neoverse-V1 (with transformers==4.48).
To achieve this we introduce PackedLinearWeightsACL (as a subclasses of PackedLinearWeightsOnednn ) with an implementation of qlinear_dynamic that uses ACL directly, while qlinear still follows the oneDNN path.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

[ghstack-poisoned]

pytorch-bot · 2025-03-05T18:43:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148583

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit fe602ae with merge base 6c3492b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This enables a fast path for eager mode dynamic quantization for AArch64 through Arm Compute Library (ACL) directly. Context: PR #126687 enabled an optimized implementation for qlinear_dynamic for aarch64 through ideep → oneDNN → ACL which improved performance by ~10x compared to the previous implementation. However, the current qlinear_dynamic path (ideep → oneDNN → ACL) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (lowp_gemm) API - for example, ACL's lowp_gemm objects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature. Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kerne's optimal format) for each GEMM operation. This PR addresses the sub-optimalities above by integrating ACL directly with qlinear_dynamic. This approach yields an average speedup (averaged over context_lengths of 2^3 up to 2^9) of ~ 50% for bert-base-uncased, bert-large-uncased, roberta-base, distilbert-base-uncased with 16 threads on a Neoverse-V1 (with transformers==4.48). To achieve this we introduce PackedLinearWeightsACL (as a subclasses of PackedLinearWeightsOnednn ) with an implementation of qlinear_dynamic that uses ACL directly, while qlinear still follows the oneDNN path. ghstack-source-id: 555097f Pull Request resolved: #148583

fadara01 · 2025-03-06T08:23:01Z

Sorry, this PR is a mistake, I'm a ghstack newbie

Update

fe602ae

[ghstack-poisoned]

fadara01 requested review from digantdesai, jerryzh168, jianyuh, kimishpatel and salilsdesai as code owners March 5, 2025 18:43

fadara01 mentioned this pull request Mar 5, 2025

Enable Direct Use of Arm Compute Library (ACL) in ATen #148582

Closed

pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: quantization release notes category labels Mar 5, 2025

pytorchbot added the open source label Mar 5, 2025

fadara01 changed the title ~~Enable fast qlinear_dynamic path for AArch64 through ACL directly~~ Enable fast qlinear static/dynamic path for AArch64 through ACL directly Mar 5, 2025

fadara01 closed this Mar 6, 2025

github-actions bot deleted the gh/fadara01/3/head branch April 11, 2025 02:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable fast qlinear static/dynamic path for AArch64 through ACL directly #148583

Enable fast qlinear static/dynamic path for AArch64 through ACL directly #148583

Uh oh!

fadara01 commented Mar 5, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Mar 5, 2025 •

edited

Loading

Uh oh!

fadara01 commented Mar 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Enable fast qlinear static/dynamic path for AArch64 through ACL directly #148583

Enable fast qlinear static/dynamic path for AArch64 through ACL directly #148583

Uh oh!

Conversation

fadara01 commented Mar 5, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148583

✅ No Failures

Uh oh!

fadara01 commented Mar 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fadara01 commented Mar 5, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 5, 2025 •

edited

Loading