Skip to content

Conversation

@fadara01
Copy link
Collaborator

@fadara01 fadara01 commented Mar 5, 2025

Stack from ghstack (oldest at bottom):

This enables a fast path for eager mode dynamic quantization for AArch64 through Arm Compute Library (ACL) directly.

Context: PR #126687 enabled an optimized implementation for qlinear_dynamic for aarch64 through ideep → oneDNN → ACL which improved performance by ~10x compared to the previous implementation.
However, the current qlinear_dynamic path (ideep → oneDNN → ACL) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (lowp_gemm) API - for example, ACL's lowp_gemm objects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature.
Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kerne's optimal format) for each GEMM operation.
This PR addresses the sub-optimalities above by integrating ACL directly with qlinear_dynamic. This approach yields an average speedup (averaged over context_lengths of 2^3 up to 2^9) of ~ 50% for bert-base-uncased, bert-large-uncased, roberta-base, distilbert-base-uncased with 16 threads on a Neoverse-V1 (with transformers==4.48).
To achieve this we introduce PackedLinearWeightsACL (as a subclasses of PackedLinearWeightsOnednn ) with an implementation of qlinear_dynamic that uses ACL directly, while qlinear still follows the oneDNN path.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: quantization release notes category labels Mar 5, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 5, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148583

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit fe602ae with merge base 6c3492b (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fadara01 added a commit that referenced this pull request Mar 5, 2025
This enables a fast path for eager mode dynamic quantization for AArch64 through Arm Compute Library (ACL) directly.

Context: PR #126687 enabled an optimized implementation for qlinear_dynamic for aarch64 through ideep → oneDNN → ACL which improved performance by ~10x compared to the previous implementation.
However, the current qlinear_dynamic path (ideep → oneDNN → ACL) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (lowp_gemm) API - for example, ACL's lowp_gemm objects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature.
Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kerne's optimal format) for each GEMM operation.
This PR addresses the sub-optimalities above by integrating ACL directly with qlinear_dynamic. This approach yields an average speedup (averaged over context_lengths of 2^3 up to 2^9) of ~ 50% for bert-base-uncased, bert-large-uncased, roberta-base, distilbert-base-uncased with 16 threads on a Neoverse-V1 (with transformers==4.48).
To achieve this we introduce PackedLinearWeightsACL (as a subclasses of PackedLinearWeightsOnednn ) with an implementation of qlinear_dynamic that uses ACL directly, while qlinear still follows the oneDNN path.

ghstack-source-id: 555097f
Pull Request resolved: #148583
@fadara01 fadara01 changed the title Enable fast qlinear_dynamic path for AArch64 through ACL directly Enable fast qlinear static/dynamic path for AArch64 through ACL directly Mar 5, 2025
@fadara01
Copy link
Collaborator Author

fadara01 commented Mar 6, 2025

Sorry, this PR is a mistake, I'm a ghstack newbie

@fadara01 fadara01 closed this Mar 6, 2025
@github-actions github-actions bot deleted the gh/fadara01/3/head branch April 11, 2025 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: quantization release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants