-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Enable fast qlinear_dynamic path for AArch64 through ACL directly #145942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145942
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit ac5618b with merge base 6c3492b ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "module: arm" |
7b44387 to
802e2a6
Compare
|
@pytorchbot label "ciflow/linux-aarch64" |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
802e2a6 to
bcec64d
Compare
a98ea5c to
35c032c
Compare
malfet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR, mostly looks good, though if possible, please submit a separate PR that updates ACL version.
| #pragma once | ||
|
|
||
| #include <ATen/Config.h> | ||
| #if defined(__aarch64__) && AT_MKLDNN_ACL_ENABLED() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need an arch guard there?
| #if defined(__aarch64__) && AT_MKLDNN_ACL_ENABLED() | |
| #if AT_MKLDNN_ACL_ENABLED() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not, I removed it. AT_MKLDNN_ACL_ENABLED is enough
| int64_t // NUM_THREADS | ||
| >; | ||
|
|
||
| enum ACLDynamicQuantMatmulCacheKeyIndex { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| enum ACLDynamicQuantMatmulCacheKeyIndex { | |
| enum class ACLDynamicQuantMatmulCacheKeyIndex { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| return dim == 2 ? output : output.reshape(output_size); | ||
| } | ||
|
|
||
| #if defined(__aarch64__) && AT_MKLDNN_ACL_ENABLED() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this needs arch check?
| #if defined(__aarch64__) && AT_MKLDNN_ACL_ENABLED() | |
| #if AT_MKLDNN_ACL_ENABLED() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need it, I removed it. AT_MKLDNN_ACL_ENABLED is enough
| if (with_bias) { | ||
| bia_tensor.allocator()->free(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to express something like that by defining bias tensor as std::optional<arm_compute::Tensor> bias_tensor;?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I now use std::optional for bia_tensor and bia_tensor_info
…tly. This enables a fast path for eager mode static quantization for AArch64 through Arm Compute Library (ACL) directly. PR pytorch#145942 addressed the high overhead in qlinear_dynamic on AArch64 (due to redundant weight pretranspositions and reductions) by enabling a path that calls ACL directly. This does the same thing but for (static) qlinear.
…tly. This enables a fast path for eager mode static quantization for AArch64 through Arm Compute Library (ACL) directly. PR pytorch#145942 addressed the high overhead in qlinear_dynamic on AArch64 (due to redundant weight pretranspositions and reductions) by enabling a path that calls ACL directly. This does the same thing but for (static) qlinear.
…tly. This enables a fast path for eager mode static quantization for AArch64 through Arm Compute Library (ACL) directly. PR pytorch#145942 addressed the high overhead in qlinear_dynamic on AArch64 (due to redundant weight pretranspositions and reductions) by enabling a path that calls ACL directly. This does the same thing but for (static) qlinear.
…tly. This enables a fast path for eager mode static quantization for AArch64 through Arm Compute Library (ACL) directly. PR pytorch#145942 addressed the high overhead in qlinear_dynamic on AArch64 (due to redundant weight pretranspositions and reductions) by enabling a path that calls ACL directly. This does the same thing but for (static) qlinear.
ACL is already built with PyTorch as a shared library when USE_MKLDNN_ACL is set. Currently, it is only used indirectly in ATen via oneDNN for AArch64 targets. However there are cases where it makes sense to utilize ACL directly without oneDNN as an intermediary - e.g. quantization. See pytorch#145942, pytorch#147337, pytorch#146620. This patch enables such use cases by exposing ACL to ATen
This enables a fast path for eager mode dynamic quantization for AArch64 through Arm Compute Library (ACL) directly. Context: PR pytorch#126687 enabled an optimized implementation for qlinear_dynamic for aarch64 through ideep → oneDNN → ACL which improved performance by ~10x compared to the previous implementation. However, the current qlinear_dynamic path (ideep → oneDNN → ACL) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (lowp_gemm) API - for example, ACL's lowp_gemm objects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature. Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kerne's optimal format) for each GEMM operation. This PR addresses the sub-optimalities above by integrating ACL directly with qlinear_dynamic. This approach yields an average speedup (averaged over context_lengths of 2^3 up to 2^9) of ~ 50% for bert-base-uncased, bert-large-uncased, roberta-base, distilbert-base-uncased with 16 threads on a Neoverse-V1 (with transformers==4.48). To achieve this we introduce PackedLinearWeightsACL (as a subclasses of PackedLinearWeightsOnednn ) with an implementation of qlinear_dynamic that uses ACL directly, while qlinear still follows the oneDNN path.
1542c78 to
ac5618b
Compare
…tly. This enables a fast path for eager mode static quantization for AArch64 through Arm Compute Library (ACL) directly. PR pytorch#145942 addressed the high overhead in qlinear_dynamic on AArch64 (due to redundant weight pretranspositions and reductions) by enabling a path that calls ACL directly. This does the same thing but for (static) qlinear.
|
I created a standalone PR #148542 for all cmake related changes and removed them from here to ease the review process. |
malfet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know much about this particular codepath, but requesting changes solely for the integration strategy. (speaking of strategy, it would be good to have an RFC issue outlining ACL/oneDNN integration - i.e. what is the end goal: fully decouple ACL from oneDNN or keep some direct usage until oneDNN integration is done, or it it about something else)
So back to integration:
- Please move logic that searches fro ACL into a separate PR (you have write permissions, so you can you
ghstack, can't you) and use modern cmake (that defines target rather than global variables) to introduce new dependency - Avoid explicit memory management (i.e. if something needs to free the memory, wrap it into a simple unique_ptr)
- Avoid implementing methods in headers unless those are inline methods or templates
- Also, as much as possible please use Torch memory allocator, rather than mix ACL and Torch ones, as it will make memory tracking/reporting easier
Last but not least: you've added the script that benchmarks the perf, but did not share the numbers before and after, that would help one understand the benefits this PR brings
|
|
||
| int64_t k_; | ||
| int64_t n_; | ||
| int64_t wei_zero_point_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a document outlining the naming convention? Underscore usually means private variable, but they are public, as this is a struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I meant to actually make them private.
I addressed this in the new ghstack PR, please see this line
| # FindACL | ||
| # ---------- | ||
| # | ||
| # Finds the Arm Compute Library |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this file exists somewhere else? If so, please reference where it was copied from
If this is create exclusively for PyTorch, please use modern CMAKE, i.e. instead of (or in addition to) defining global variables add libraries/targets, something like
add_library(ArmComputeLib INTERFACE)
target_link_libraries(ArmComputeLib INTERFACE ${ACL_LIBRARIES})
target_include_directories(ArmComputeLib INTERFACE ${ACL_INCLUDE_DIRS})There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file was not exclusively create for PyTorch, it was copied from oneDNN: https://github.com/oneapi-src/oneDNN/blob/main/cmake/FindACL.cmake
I referenced that in the new ghstack PR - see this line
| ~ACLDynamicQuantMatmul() { | ||
| // this will free memory allocated for the quantized src tensor since the | ||
| // allocation happened through ACL: src_s8_tensor.allocator()->allocate() | ||
| src_s8_tensor.allocator()->free(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very unsafe programming model: there are no default constructor, so this structure could be allocated uninitialized, and then freed. I have not tried it, but it's very likely if someone writes something like
{
ACLDynamicQuantMatmul v;
}
it will crash in the destructor, as wei_tensor has not be allocated, but it's allocator()->free() methods is called unconditionally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, sorry for the confusion, my comment above is not right.
tensor.allocator()->free() never frees any memory (whether that memory was allocated by ACL or not). It just tells ACL that we're no longer using the pointer - See here - this can't lead to crashes.
The memory allocated by ACL is freed automatically.
I agree the structure here is not nice.
I added constructors and made sure all memory allocations happen through PyTorch - See here
| @@ -0,0 +1,257 @@ | |||
| #pragma once | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add some sort of a comment explaining what ACL is (in computing, this acronym is most commonly associated with access control lists, see https://en.wikipedia.org/wiki/ACL ) and what classes/functions defined in this header are supposed to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done in the ghstack PR here
| at::Tensor apply_dynamic_relu(at::Tensor input, bool reduce_range = false) | ||
| override; | ||
|
|
||
| std::shared_ptr<ACLDynamicQuantMatmul> get_acl_dynamic_quant_matmul( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious what is the though process here of having implementation in the header vs respective CPP file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial PoC was simple enough that it did not require a cpp file for implementation.
I agree that the implementation has gotten complex enough that it needs a cpp file.
I addressed this in the new ghstack PR, please see here
|
|
||
| std::shared_ptr<ACLDynamicQuantMatmul> get_acl_dynamic_quant_matmul( | ||
| const ACLDynamicQuantMatmulCacheKey& key) { | ||
| // We're only maintaining a 2 element LRU cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for nitpicks. Few thoughts:
- LRU cache idea does not seem to be unique to ACL, so implementation should exist someplace else. If not, please add the implementation to say
c10/utils/lru_cache.hand then use it here - Can variable name be shorter here? (just
cacheorquant_cache? - Again, naming convention, as variable is private, shouldn't its name end with
_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LRU cache is not unique to ACL indeed. I could not find an implementation to use and given that our LRU cache impl just keeps track of two elements only, I don't see the point of making it global to PyTorch outside ACLUtils.h
If we end up implementing a (real) more complex LRU cache in the future, we'll add it to c10/utils/lru_cache.h and use it from there.
I agree with your comments about the name, it is now cache_ in the new ghstack PR, please see this line
| std::rotate( | ||
| acl_dynamic_quant_cache.begin(), | ||
| acl_dynamic_quant_cache.begin() + 1, | ||
| acl_dynamic_quant_cache.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your cache size is two, wouldn't std::swap(cache[0], cache[1]) would be an equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, thank you!
I addressed this in the new ghstack PR here
| &acl_gemm->dst_tensor_info, | ||
| &acl_gemm->dst_tensor_info, | ||
| acl_gemm->acl_relu_info); | ||
| if (relu_status.error_code() != arm_compute::ErrorCode::OK) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want to add TORCH_WARN or something, so that users know something went wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great Idea, done here
|
|
||
| // validate that ACL can handle the given problem and inputs. | ||
| if (fuse_relu) { | ||
| arm_compute::Status relu_status = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use auto there?
| arm_compute::Status relu_status = | |
| auto relu_status = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done here
|
|
||
| // allocate memory only for the quantized tensor, the rest will use memory | ||
| // already avaliable from PyTorch | ||
| acl_gemm->src_s8_tensor.allocator()->allocate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, why use ACL allocator instead of an existing PyTorch one? So that memory tracking story is cleaner, i.e. all memory is allocated/tracked and freed by PyTorch caching allocator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you.
I now use PyTorch for all allocations - ACL just import pointers but does not explicitly allocate/deallocate any memory.
See here
Please check the PR description above, where I say:
|
ACL is already built with PyTorch as a shared library when USE_MKLDNN_ACL is set. Currently, it is only used indirectly in ATen via oneDNN for AArch64 targets. However there are cases where it makes sense to utilize ACL directly without oneDNN as an intermediary - e.g. quantization. See #145942, #147337, #146620. This patch enables such use cases by exposing ACL to ATen ghstack-source-id: 266c621 Pull Request resolved: #148581
…tly. This enables a fast path for eager mode static quantization for AArch64 through Arm Compute Library (ACL) directly. PR #145942 addressed the high overhead in qlinear_dynamic on AArch64 (due to redundant weight pretranspositions and reductions) by enabling a path that calls ACL directly. This does the same thing but for (static) qlinear. ghstack-source-id: 05435a0 Pull Request resolved: #148586
This enables qint8 and quint8 add for AArch64 through Arm Compute Library (ACL) directly. It's based on changes in PR pytorch#145942 which enables the use of ACL directly in ATen. Relative performance improvement using OMP_NUM_THREADS=1 is ~15x, using OMP_NUM_THREADS=32 it’s ~5.4x.
I created ghtack PRs for this:
I addressed reviews on this PR in #148584 - it's equivalent ghstack one. Unfortunately, I couldn't convert these PRs to be ghtack PRs because they're on feature branches belonging to my PyTorch fork, hence the new PRs. |
The plan for now is to enable this fast path until direct fast path to ACL from oneDNN is implemented |
ACL is already built with PyTorch as a shared library when USE_MKLDNN_ACL is set. Currently, it is only used indirectly in ATen via oneDNN for AArch64 targets. However there are cases where it makes sense to utilize ACL directly without oneDNN as an intermediary - e.g. quantization. See #145942, #147337, #146620. This patch enables such use cases by exposing ACL to ATen Pull Request resolved: #148584 Approved by: https://github.com/malfet
|
Closing in favor of ghstack PR #148585 which has all comments addressed |
This enables a fast path for eager mode dynamic quantization for AArch64 through Arm Compute Library (ACL) directly.
Context: PR #126687 enabled an optimized implementation for
qlinear_dynamicfor AArch64 through ideep → oneDNN → ACL which improved performance by ~10x compared to the previous implementation.However, the current
qlinear_dynamicpath (ideep → oneDNN → ACL) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (lowp_gemm) API - for example, ACL'slowp_gemmobjects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature. Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kernel's optimal format) for each GEMM operation.This PR addresses the sub-optimalities above by integrating ACL directly with
qlinear_dynamic. This approach yields an average speedup (averaged over context_lengths of 2^3 up to 2^9) of ~ 50% forbert-base-uncased,bert-large-uncased,roberta-base,distilbert-base-uncasedwith 16 threads on a Neoverse-V1 (withtransformers==4.48) - See benchmark code below. To achieve this, we:USE_MKLDNN_ACLis set.PackedLinearWeightsACL(as a subclasses ofPackedLinearWeightsOnednn) with an implementation ofqlinear_dynamicthat uses ACL directly, whileqlinearstill follows the oneDNN path.qlinearand will allow us to remove the dependence onPackedLinearWeightsOnednnThe following code was used to benchmark
qlinear_dynamicperformance:Fixes #ISSUE_NUMBER
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @malfet @snadampal @milpuz01