KleidiAI SGEMM/IGEMM/Quantized MatMul - Modular MLAS API Changes for KleidiAI #25187

damdoo01-arm · 2025-06-26T15:07:07Z

This PR introduces the initial integration of KleidiAI-optimized microkernels into ONNX Runtime's MLAS backend, focusing on support for:

SGEMM
IGEMM
Dynamic Quantized MatMuls

Key changes:
Implements overrides for MlasGemmBatch, MlasGemmPackBSize, and MlasGemmPackB using KleidiAI where applicable.
Applies dispatch logic based on TransA == CblasNoTrans and SME2 availability.
Supports float32 and int8 GEMM workloads with conditionally invoked SME2 paths.
Maintains fallback paths to default MLAS implementations to ensure coverage and stability.

Known Issues / Next Steps:
Requesting feedback specifically on the API structure:
Does the new MLAS interface design align with long-term extensibility?
Are the dispatch points and override boundaries well-structured?

Indicative Performance figures:
The kernels added are particularly effective for Conv2D operators:

Based on KleidiAI SME running mobilenet_v1_ssd_f32 on Mac Mini M4 on a single thread

damdoo01-arm · 2025-06-26T15:57:52Z

@microsoft-github-policy-service agree [company="{Arm}"]

damdoo01-arm · 2025-06-26T15:58:23Z

@microsoft-github-policy-service agree company="Arm"

onnxruntime/core/mlas/lib/mlas_platform.cpp

- Adding SGEMM/IGEMM for kleidiai under new architecture - Fixed offset lambda function causing release compile issue - Added HWC Changes to KAI IGEMM - Integrate HWC Transpose

Signed-off-by: Damien Dooley <[email protected]>

… 2. bench_sgemm.cpp MlasGemmPackB func sig mismatch

jywu-msft · 2025-07-09T15:26:34Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows x64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-07-09T15:26:55Z

Azure Pipelines successfully started running 5 pipeline(s).

…_Tests.MatMulIntegerToFloat

hariharans29 · 2025-07-24T21:40:17Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows x64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-07-24T21:40:39Z

Azure Pipelines successfully started running 5 pipeline(s).

…t patch file add)

hariharans29 · 2025-07-24T22:26:49Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows x64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-07-24T22:27:09Z

Azure Pipelines successfully started running 5 pipeline(s).

onnxruntime/contrib_ops/cpu/quantization/dynamic_quantize_matmul.cc

onnxruntime/core/mlas/lib/qgemm.cpp

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp

onnxruntime/core/mlas/lib/mlasi.h

damdoo01-arm · 2025-07-25T16:09:54Z

Hi @edgchen1, @hariharans29, we will include all feedback in subsequent PR. Do you think you can rerun the failing test (I feel it's internal.) Can you also approve and merge the PR to ensure it makes the release?

Thanks very much,
Damien

…ded but not constant. (#25544) ### Description  In DynamicQuantizeMatMul KleidiAI-specific prepacking logic, handle case where B zero point input is provided but not constant. In this case, we should not prepack. Add some unit tests that test the prepacking code path. Add check for ARM SME instructions in DynamicQuantizeMatMul before calling `MlasDynamicQGemmBatch()` and associated functions. ### Motivation and Context  Follow up to #25187

…KleidiAI (microsoft#25187) This PR introduces the initial integration of KleidiAI-optimized microkernels into ONNX Runtime's MLAS backend, focusing on support for: - SGEMM - IGEMM - Dynamic Quantized MatMuls Key changes: Implements overrides for MlasGemmBatch, MlasGemmPackBSize, and MlasGemmPackB using KleidiAI where applicable. Applies dispatch logic based on TransA == CblasNoTrans and SME2 availability. Supports float32 and int8 GEMM workloads with conditionally invoked SME2 paths. Maintains fallback paths to default MLAS implementations to ensure coverage and stability. **Known Issues / Next Steps:** Requesting feedback specifically on the API structure: Does the new MLAS interface design align with long-term extensibility? Are the dispatch points and override boundaries well-structured? Indicative Performance figures: The kernels added are particularly effective for Conv2D operators: * Based on KleidiAI SME running mobilenet_v1_ssd_f32 on Mac Mini M4 on a single thread <img width="815" height="308" alt="image" src="https://github.com/user-attachments/assets/e39a7fef-1370-4332-83a3-1f3a80b29da4" /> --------- Signed-off-by: Damien Dooley <[email protected]> Co-authored-by: Jonathan Clohessy <[email protected]> Co-authored-by: Declan Flavin <[email protected]> Co-authored-by: Colm Donelan <[email protected]> Co-authored-by: Damien Dooley <[email protected]>

…ded but not constant. (microsoft#25544) ### Description  In DynamicQuantizeMatMul KleidiAI-specific prepacking logic, handle case where B zero point input is provided but not constant. In this case, we should not prepack. Add some unit tests that test the prepacking code path. Add check for ARM SME instructions in DynamicQuantizeMatMul before calling `MlasDynamicQGemmBatch()` and associated functions. ### Motivation and Context  Follow up to microsoft#25187

snnn · 2025-08-28T20:42:29Z

The change is already in the 1.23.0 release branch. Therefore I removed the tag.

**Key changes** This PR integrates KleidiAI SME1 FP32 kernels into the existing kleidiai_sgemm.cpp implementation. Adding SME2 flag in onnxruntime/core/common/cpuid_info.h & onnxruntime/core/common/cpuid_info.cc Previous SME2 kernels integrated were using SME(1) check, this change will correctly distinguish between when SME1 and SME2 kernels are to be used. Bumping KleidiAI version to 1.10.0 **Indicative performance data** Single thread Mac Mini M4 runs on various models using: onnxruntime_perf_test -v -e cpu -I -m times -x 1 -y 1 -r 1 <img width="785" height="400" alt="image" src="https://github.com/user-attachments/assets/37c0b271-14fb-4b76-b2a0-28c5dd9308aa" /> **Next steps** Additional commits to come will address outstanding to-do issues from previous PR linked below: [ KleidiAI SGEMM/IGEMM/Quantized MatMul - Modular MLAS API Changes for KleidiAI #25187](#25187) Signed-off-by: Patryk Kaiser <[email protected]>

### Key changes This patch contains logging macros for the KleidiAI kernels It also contains changes / todos from a previous PR: #25187 --------- Signed-off-by: Orlaith Monahan <[email protected]> Co-authored-by: Edward Chen <[email protected]>

…soft#26146) ### Key changes This patch contains logging macros for the KleidiAI kernels It also contains changes / todos from a previous PR: microsoft#25187 --------- Signed-off-by: Orlaith Monahan <[email protected]> Co-authored-by: Edward Chen <[email protected]>

damdoo01-arm marked this pull request as draft June 26, 2025 16:26

edgchen1 reviewed Jun 27, 2025

View reviewed changes

onnxruntime/core/mlas/lib/mlas_platform.cpp Outdated Show resolved Hide resolved

JonathanC-ARM and others added 7 commits July 4, 2025 17:17

CLNTFRAME-376: Add initial pipeline setup

b8d540d

Integrate initial KFI changes

d69d3f5

- Adding SGEMM/IGEMM for kleidiai under new architecture - Fixed offset lambda function causing release compile issue - Added HWC Changes to KAI IGEMM - Integrate HWC Transpose

updated build and test to have mac stages

9e56664

Sync with latest from old repo

5c03bcd

Added Dynamic-Quantized Matmuls and GEMV

48b09e3

Fixed copyright attribution

972eef5

Signed-off-by: Damien Dooley <[email protected]>

KFI-51 Requires target "kleidiai" error building ONNX RT on aarch64.

baa63df

damdoo01-arm force-pushed the kai_sgemm_igemm_quant_gemv branch from 0b628b9 to ac8b673 Compare July 4, 2025 18:19

damdoo01-arm marked this pull request as ready for review July 4, 2025 18:20

damdoo01-arm marked this pull request as draft July 4, 2025 18:24

damdoo01-arm force-pushed the kai_sgemm_igemm_quant_gemv branch from ac8b673 to baa63df Compare July 4, 2025 18:34

MLAS API updates, mlas test fixes and ORT test fixes

a4068c1

damdoo01-arm marked this pull request as ready for review July 4, 2025 21:11

damdoo01-arm added 10 commits July 5, 2025 22:50

Remove Arm CI internal directory inadvertently pushed previously

3b34766

Fix to iOS build

44199a5

2nd attempt to fix ios build by force disabling KAI

459acf8

Wrap preprocessor ifdefs around dedicated KAI lib

c675ccd

Lint fixes

8b8e6a0

Added Android/Linux CI build fixes plus fixed a layer parser fix

91008e9

Fix to 2 more CI failures. 1. kleidiai dir not visible in some builds…

f984e81

… 2. bench_sgemm.cpp MlasGemmPackB func sig mismatch

Remove badly named directory

25e9815

Renamed kleidiai dir in lower case

3b9fd9a

Merge branch 'main' into kai_sgemm_igemm_quant_gemv

fa558b3

QGemm call fixes that resolve the failing tests in CPU_U8S8_Precision…

db144a0

…_Tests.MatMulIntegerToFloat

damdoo01-arm added 2 commits July 24, 2025 22:58

Fixed unused variable error after guard include

d45c6bb

Removed global variable and fixed transA override (removed inadverten…

4753512

…t patch file add)