[ARM CPU] Enable FP16 kernels for GQA op by fajin-corp · Pull Request #23746 · microsoft/onnxruntime

fajin-corp · 2025-02-19T00:02:50Z

Description

Enable hgemm and softmax fp16 kernels for GQA
add intra-loop parallelism to RoPE fp16 kernel

Benchmarking models

float32: phi-3 cpu accuracy level 0
float16: phi-3 gpu accuracy level 0

Note:

Both fp32 and fp16 models share the same model structure and operator settings.
GQA takes ~15% of the runtime.
prompt length 256, token generation length 512

Linux (ubuntu 24.04) Standard D16pls v5 (16 vcpus, 32 GiB memory)

	fp32 (tps)	old fp16 (tps)	new fp16 (tps)	new fp16 vs old fp16	new fp16 vs fp32
prompt processing	31.22	44.24	46.29	+4.6%	+48.25%
token generation	4.75	7.2	7.95	+10.39%	+67.43%

Motivation and Context

Speed up GQA on FP16

…icates, 3> cover lda/ldb/ldc in UT

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/mlas/unittest/test_hgemm_neon.cpp

### Description - Enable hgemm and softmax fp16 kernels for GQA - add intra-loop parallelism to RoPE fp16 kernel __Benchmarking models__ - float32: [phi-3 cpu accuracy level 0](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32) - float16: [phi-3 gpu accuracy level 0](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/tree/main/cuda/cuda-int4-rtn-block-32) Note: - Both fp32 and fp16 models share the same model structure and operator settings. - GQA takes ~15% of the runtime. - prompt length 256, token generation length 512 Linux (ubuntu 24.04) Standard D16pls v5 (16 vcpus, 32 GiB memory) | | fp32 (tps) | old fp16 (tps) | new fp16 (tps) | new fp16 vs old fp16 | new fp16 vs fp32 | |--|--|--|--|--|--| | prompt processing | 31.22 | 44.24 | 46.29 | +4.6% | +48.25% | | token generation | 4.75 | 7.2 | 7.95 | +10.39% | +67.43% | ### Motivation and Context Speed up GQA on FP16

This reverts commit 2d33ee9.

### Description - Enable hgemm and softmax fp16 kernels for GQA - add intra-loop parallelism to RoPE fp16 kernel __Benchmarking models__ - float32: [phi-3 cpu accuracy level 0](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32) - float16: [phi-3 gpu accuracy level 0](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/tree/main/cuda/cuda-int4-rtn-block-32) Note: - Both fp32 and fp16 models share the same model structure and operator settings. - GQA takes ~15% of the runtime. - prompt length 256, token generation length 512 Linux (ubuntu 24.04) Standard D16pls v5 (16 vcpus, 32 GiB memory) | | fp32 (tps) | old fp16 (tps) | new fp16 (tps) | new fp16 vs old fp16 | new fp16 vs fp32 | |--|--|--|--|--|--| | prompt processing | 31.22 | 44.24 | 46.29 | +4.6% | +48.25% | | token generation | 4.75 | 7.2 | 7.95 | +10.39% | +67.43% | ### Motivation and Context Speed up GQA on FP16

fajin-corp added 9 commits February 13, 2025 21:29

integrated mlas kernels to gqa

36785ef

fix build

922ad27

fix build

9d8599b

optimize hgemm packedb

d7d03c6

use 4 accumulators

8f1b280

loop parallelism for packing

d9fd7a2

1> added intra loop parallelism, 2> use const conditional branch pred…

eadeba9

…icates, 3> cover lda/ldb/ldc in UT

add todo

6eeb0f7

added intra loop parallelism to rope

f35744c

fajin-corp requested a review from a team as a code owner February 19, 2025 00:02

github-actions bot reviewed Feb 19, 2025

View reviewed changes

onnxruntime/test/mlas/unittest/test_hgemm_neon.cpp Show resolved Hide resolved

fajin-corp added 2 commits February 19, 2025 00:08

fix linting

c83765d

fix build

9908e1e

amarin16 approved these changes Feb 20, 2025

View reviewed changes

fajin-corp merged commit 2d33ee9 into main Feb 20, 2025
96 of 98 checks passed

fajin-corp deleted the fajin/gqa-integrate branch February 20, 2025 17:38

snnn pushed a commit that referenced this pull request Mar 10, 2025

Revert "[ARM CPU] Enable FP16 kernels for GQA op (#23746)"

89de008

This reverts commit 2d33ee9.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARM CPU] Enable FP16 kernels for GQA op #23746

[ARM CPU] Enable FP16 kernels for GQA op #23746
fajin-corp merged 11 commits intomainfrom
fajin/gqa-integrate

fajin-corp commented Feb 19, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fajin-corp commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fajin-corp commented Feb 19, 2025 •

edited

Loading