Skip to content

Conversation

@damdoo01-arm
Copy link
Contributor

@damdoo01-arm damdoo01-arm commented Jun 26, 2025

This PR introduces the initial integration of KleidiAI-optimized microkernels into ONNX Runtime's MLAS backend, focusing on support for:

  • SGEMM
  • IGEMM
  • Dynamic Quantized MatMuls

Key changes:
Implements overrides for MlasGemmBatch, MlasGemmPackBSize, and MlasGemmPackB using KleidiAI where applicable.
Applies dispatch logic based on TransA == CblasNoTrans and SME2 availability.
Supports float32 and int8 GEMM workloads with conditionally invoked SME2 paths.
Maintains fallback paths to default MLAS implementations to ensure coverage and stability.

Known Issues / Next Steps:
Requesting feedback specifically on the API structure:
Does the new MLAS interface design align with long-term extensibility?
Are the dispatch points and override boundaries well-structured?

Indicative Performance figures:
The kernels added are particularly effective for Conv2D operators:

  • Based on KleidiAI SME running mobilenet_v1_ssd_f32 on Mac Mini M4 on a single thread
image

@damdoo01-arm
Copy link
Contributor Author

@microsoft-github-policy-service agree [company="{Arm}"]

@damdoo01-arm
Copy link
Contributor Author

@microsoft-github-policy-service agree company="Arm"

@damdoo01-arm damdoo01-arm marked this pull request as draft June 26, 2025 16:26
@damdoo01-arm damdoo01-arm force-pushed the kai_sgemm_igemm_quant_gemv branch from 0b628b9 to ac8b673 Compare July 4, 2025 18:19
@damdoo01-arm damdoo01-arm marked this pull request as ready for review July 4, 2025 18:20
@damdoo01-arm damdoo01-arm marked this pull request as draft July 4, 2025 18:24
@damdoo01-arm damdoo01-arm force-pushed the kai_sgemm_igemm_quant_gemv branch from ac8b673 to baa63df Compare July 4, 2025 18:34
@damdoo01-arm damdoo01-arm marked this pull request as ready for review July 4, 2025 21:11
@jywu-msft
Copy link
Member

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows x64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows x64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows x64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@damdoo01-arm
Copy link
Contributor Author

Hi @edgchen1, @hariharans29, we will include all feedback in subsequent PR. Do you think you can rerun the failing test (I feel it's internal.) Can you also approve and merge the PR to ensure it makes the release?

Thanks very much,
Damien

@hariharans29 hariharans29 merged commit cd450d1 into microsoft:main Jul 25, 2025
87 of 90 checks passed
jywu-msft pushed a commit that referenced this pull request Jul 26, 2025
…ded but not constant. (#25544)

### Description
<!-- Describe your changes. -->

In DynamicQuantizeMatMul KleidiAI-specific prepacking logic, handle case
where B zero point input is provided but not constant. In this case, we
should not prepack.

Add some unit tests that test the prepacking code path.

Add check for ARM SME instructions in DynamicQuantizeMatMul before
calling `MlasDynamicQGemmBatch()` and associated functions.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Follow up to #25187
snnn pushed a commit that referenced this pull request Jul 28, 2025
…ded but not constant. (#25544)

### Description
<!-- Describe your changes. -->

In DynamicQuantizeMatMul KleidiAI-specific prepacking logic, handle case
where B zero point input is provided but not constant. In this case, we
should not prepack.

Add some unit tests that test the prepacking code path.

Add check for ARM SME instructions in DynamicQuantizeMatMul before
calling `MlasDynamicQGemmBatch()` and associated functions.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Follow up to #25187
sanketkaleoss pushed a commit to sanketkaleoss/onnxruntime that referenced this pull request Aug 11, 2025
…KleidiAI (microsoft#25187)

This PR introduces the initial integration of KleidiAI-optimized
microkernels into ONNX Runtime's MLAS backend, focusing on support for:

- SGEMM
- IGEMM
- Dynamic Quantized MatMuls

Key changes:
Implements overrides for MlasGemmBatch, MlasGemmPackBSize, and
MlasGemmPackB using KleidiAI where applicable.
Applies dispatch logic based on TransA == CblasNoTrans and SME2
availability.
Supports float32 and int8 GEMM workloads with conditionally invoked SME2
paths.
Maintains fallback paths to default MLAS implementations to ensure
coverage and stability.

**Known Issues / Next Steps:**
Requesting feedback specifically on the API structure:
Does the new MLAS interface design align with long-term extensibility?
Are the dispatch points and override boundaries well-structured?

Indicative Performance figures:
The kernels added are particularly effective for Conv2D operators:
* Based on KleidiAI SME running mobilenet_v1_ssd_f32 on Mac Mini M4 on a
single thread
<img width="815" height="308" alt="image"
src="https://github.com/user-attachments/assets/e39a7fef-1370-4332-83a3-1f3a80b29da4"
/>

---------

Signed-off-by: Damien Dooley <[email protected]>
Co-authored-by: Jonathan Clohessy <[email protected]>
Co-authored-by: Declan Flavin <[email protected]>
Co-authored-by: Colm Donelan <[email protected]>
Co-authored-by: Damien Dooley <[email protected]>
sanketkaleoss pushed a commit to sanketkaleoss/onnxruntime that referenced this pull request Aug 11, 2025
…ded but not constant. (microsoft#25544)

### Description
<!-- Describe your changes. -->

In DynamicQuantizeMatMul KleidiAI-specific prepacking logic, handle case
where B zero point input is provided but not constant. In this case, we
should not prepack.

Add some unit tests that test the prepacking code path.

Add check for ARM SME instructions in DynamicQuantizeMatMul before
calling `MlasDynamicQGemmBatch()` and associated functions.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Follow up to microsoft#25187
@snnn
Copy link
Contributor

snnn commented Aug 28, 2025

The change is already in the 1.23.0 release branch. Therefore I removed the tag.

edgchen1 pushed a commit that referenced this pull request Sep 12, 2025
**Key changes**
This PR integrates KleidiAI SME1 FP32 kernels into the existing
kleidiai_sgemm.cpp implementation.

Adding SME2 flag in onnxruntime/core/common/cpuid_info.h &
onnxruntime/core/common/cpuid_info.cc
Previous SME2 kernels integrated were using SME(1) check, this change
will correctly distinguish between when SME1 and SME2 kernels are to be
used.

Bumping KleidiAI version to 1.10.0

**Indicative performance data**
Single thread Mac Mini M4 runs on various models using:
onnxruntime_perf_test -v -e cpu -I -m times -x 1 -y 1 -r 1
<img width="785" height="400" alt="image"
src="https://github.com/user-attachments/assets/37c0b271-14fb-4b76-b2a0-28c5dd9308aa"
/>

**Next steps**
Additional commits to come will address outstanding to-do issues from
previous PR linked below:
[ KleidiAI SGEMM/IGEMM/Quantized MatMul - Modular MLAS API Changes for
KleidiAI #25187](#25187)

Signed-off-by: Patryk Kaiser <[email protected]>
hariharans29 pushed a commit that referenced this pull request Nov 6, 2025
### Key changes
This patch contains logging macros for the KleidiAI kernels
It also contains changes / todos from a previous PR:
#25187

---------

Signed-off-by: Orlaith Monahan <[email protected]>
Co-authored-by: Edward Chen <[email protected]>
Rohanjames1997 pushed a commit to Rohanjames1997/onnxruntime that referenced this pull request Dec 4, 2025
…soft#26146)

### Key changes
This patch contains logging macros for the KleidiAI kernels
It also contains changes / todos from a previous PR:
microsoft#25187

---------

Signed-off-by: Orlaith Monahan <[email protected]>
Co-authored-by: Edward Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants