-
Notifications
You must be signed in to change notification settings - Fork 3.7k
KleidiAI SGEMM/IGEMM/Quantized MatMul - Modular MLAS API Changes for KleidiAI #25187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KleidiAI SGEMM/IGEMM/Quantized MatMul - Modular MLAS API Changes for KleidiAI #25187
Conversation
|
@microsoft-github-policy-service agree [company="{Arm}"] |
|
@microsoft-github-policy-service agree company="Arm" |
- Adding SGEMM/IGEMM for kleidiai under new architecture - Fixed offset lambda function causing release compile issue - Added HWC Changes to KAI IGEMM - Integrate HWC Transpose
Signed-off-by: Damien Dooley <[email protected]>
0b628b9 to
ac8b673
Compare
ac8b673 to
baa63df
Compare
… 2. bench_sgemm.cpp MlasGemmPackB func sig mismatch
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows x64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
…_Tests.MatMulIntegerToFloat
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows x64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows x64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
|
Hi @edgchen1, @hariharans29, we will include all feedback in subsequent PR. Do you think you can rerun the failing test (I feel it's internal.) Can you also approve and merge the PR to ensure it makes the release? Thanks very much, |
…ded but not constant. (#25544) ### Description <!-- Describe your changes. --> In DynamicQuantizeMatMul KleidiAI-specific prepacking logic, handle case where B zero point input is provided but not constant. In this case, we should not prepack. Add some unit tests that test the prepacking code path. Add check for ARM SME instructions in DynamicQuantizeMatMul before calling `MlasDynamicQGemmBatch()` and associated functions. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Follow up to #25187
…ded but not constant. (#25544) ### Description <!-- Describe your changes. --> In DynamicQuantizeMatMul KleidiAI-specific prepacking logic, handle case where B zero point input is provided but not constant. In this case, we should not prepack. Add some unit tests that test the prepacking code path. Add check for ARM SME instructions in DynamicQuantizeMatMul before calling `MlasDynamicQGemmBatch()` and associated functions. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Follow up to #25187
…KleidiAI (microsoft#25187) This PR introduces the initial integration of KleidiAI-optimized microkernels into ONNX Runtime's MLAS backend, focusing on support for: - SGEMM - IGEMM - Dynamic Quantized MatMuls Key changes: Implements overrides for MlasGemmBatch, MlasGemmPackBSize, and MlasGemmPackB using KleidiAI where applicable. Applies dispatch logic based on TransA == CblasNoTrans and SME2 availability. Supports float32 and int8 GEMM workloads with conditionally invoked SME2 paths. Maintains fallback paths to default MLAS implementations to ensure coverage and stability. **Known Issues / Next Steps:** Requesting feedback specifically on the API structure: Does the new MLAS interface design align with long-term extensibility? Are the dispatch points and override boundaries well-structured? Indicative Performance figures: The kernels added are particularly effective for Conv2D operators: * Based on KleidiAI SME running mobilenet_v1_ssd_f32 on Mac Mini M4 on a single thread <img width="815" height="308" alt="image" src="https://github.com/user-attachments/assets/e39a7fef-1370-4332-83a3-1f3a80b29da4" /> --------- Signed-off-by: Damien Dooley <[email protected]> Co-authored-by: Jonathan Clohessy <[email protected]> Co-authored-by: Declan Flavin <[email protected]> Co-authored-by: Colm Donelan <[email protected]> Co-authored-by: Damien Dooley <[email protected]>
…ded but not constant. (microsoft#25544) ### Description <!-- Describe your changes. --> In DynamicQuantizeMatMul KleidiAI-specific prepacking logic, handle case where B zero point input is provided but not constant. In this case, we should not prepack. Add some unit tests that test the prepacking code path. Add check for ARM SME instructions in DynamicQuantizeMatMul before calling `MlasDynamicQGemmBatch()` and associated functions. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Follow up to microsoft#25187
|
The change is already in the 1.23.0 release branch. Therefore I removed the tag. |
**Key changes** This PR integrates KleidiAI SME1 FP32 kernels into the existing kleidiai_sgemm.cpp implementation. Adding SME2 flag in onnxruntime/core/common/cpuid_info.h & onnxruntime/core/common/cpuid_info.cc Previous SME2 kernels integrated were using SME(1) check, this change will correctly distinguish between when SME1 and SME2 kernels are to be used. Bumping KleidiAI version to 1.10.0 **Indicative performance data** Single thread Mac Mini M4 runs on various models using: onnxruntime_perf_test -v -e cpu -I -m times -x 1 -y 1 -r 1 <img width="785" height="400" alt="image" src="https://github.com/user-attachments/assets/37c0b271-14fb-4b76-b2a0-28c5dd9308aa" /> **Next steps** Additional commits to come will address outstanding to-do issues from previous PR linked below: [ KleidiAI SGEMM/IGEMM/Quantized MatMul - Modular MLAS API Changes for KleidiAI #25187](#25187) Signed-off-by: Patryk Kaiser <[email protected]>
### Key changes This patch contains logging macros for the KleidiAI kernels It also contains changes / todos from a previous PR: #25187 --------- Signed-off-by: Orlaith Monahan <[email protected]> Co-authored-by: Edward Chen <[email protected]>
…soft#26146) ### Key changes This patch contains logging macros for the KleidiAI kernels It also contains changes / todos from a previous PR: microsoft#25187 --------- Signed-off-by: Orlaith Monahan <[email protected]> Co-authored-by: Edward Chen <[email protected]>
This PR introduces the initial integration of KleidiAI-optimized microkernels into ONNX Runtime's MLAS backend, focusing on support for:
Key changes:
Implements overrides for MlasGemmBatch, MlasGemmPackBSize, and MlasGemmPackB using KleidiAI where applicable.
Applies dispatch logic based on TransA == CblasNoTrans and SME2 availability.
Supports float32 and int8 GEMM workloads with conditionally invoked SME2 paths.
Maintains fallback paths to default MLAS implementations to ensure coverage and stability.
Known Issues / Next Steps:
Requesting feedback specifically on the API structure:
Does the new MLAS interface design align with long-term extensibility?
Are the dispatch points and override boundaries well-structured?
Indicative Performance figures:
The kernels added are particularly effective for Conv2D operators: