Skip to content

Conversation

@fadara01
Copy link
Collaborator

@fadara01 fadara01 commented Mar 5, 2025

Stack from ghstack (oldest at bottom):

This enables a fast path for eager mode static quantization for AArch64 through Arm Compute Library (ACL) directly.

PR #145942 addressed the high overhead in qlinear_dynamic on AArch64 (due to redundant weight pretranspositions and reductions) by enabling a path that calls ACL directly.
This does the same thing but for (static) qlinear.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 5, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148586

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 7018fbb with merge base 6c3492b (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: quantization release notes category labels Mar 5, 2025
fadara01 added a commit that referenced this pull request Mar 5, 2025
…tly.

This enables a fast path for eager mode static quantization for AArch64 through Arm Compute Library (ACL) directly.

PR #145942 addressed the high overhead in qlinear_dynamic on AArch64 (due to redundant weight pretranspositions and reductions) by enabling a path that calls ACL directly.
This does the same thing but for (static) qlinear.

ghstack-source-id: 05435a0
Pull Request resolved: #148586
fadara01 added a commit that referenced this pull request Mar 5, 2025
This enables a fast path for eager mode static/dynamic quantization for AArch64 through Arm Compute Library (ACL) directly.

Context: PR #126687, #139887 enabled an optimized implementation for qlinear[_dynamic] for AArch64 through ideep → oneDNN → ACL which improved performance by ~10x compared to the previous implementation.
However, the current qlinear[_dynamic] path (ideep → oneDNN → ACL) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (lowp_gemm) API - for example, ACL's lowp_gemm objects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature.
Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kerne's optimal format) for each GEMM operation.

This PR addresses the sub-optimalities above by introducing PackedLinearWeightsACL (as a subclasses of PackedLinearWeightsOnednn ) with an implementation of qlinear[_dynamic] that uses ACL directly.
ghstack-source-id: 05435a0
Pull Request resolved: #148585

ghstack-source-id: 05435a0
Pull Request resolved: #148586
@fadara01
Copy link
Collaborator Author

fadara01 commented Mar 6, 2025

Sorry, this got raised by mistake.

@fadara01 fadara01 closed this Mar 6, 2025
@github-actions github-actions bot deleted the gh/fadara01/6/head branch April 11, 2025 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: quantization release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants