[webgpu] Use components for VxAttentionScore by qjia7 · Pull Request #23726 · microsoft/onnxruntime

qjia7 · 2025-02-17T08:58:25Z

For phi3.5-gqa-static sum_long(>1000 tokens) on meteor lake.

Before:
300 tokens in 27.0sec, e2e:11.1 tps, prompt: 212.4 tps, gen: 14.2 tps, ttft: 5.85 sec

After:
300 tokens in 23.0sec, e2e:13.0 tps, prompt: 248.9 tps, gen: 16.6 tps, ttft: 4.99 sec

For phi3.5-gqa-static sum_long(>1000 tokens) on meteor lake. Before: 300 tokens in 27.0sec, e2e:11.1 tps, prompt: 212.4 tps, gen: 14.2 tps, ttft: 5.85 sec After: 300 tokens in 23.0sec, e2e:13.0 tps, prompt: 248.9 tps, gen: 16.6 tps, ttft: 4.99 sec

qjia7 · 2025-02-18T09:59:43Z

@sushraja-msft @guschmue This PR applies the flash attention's logic so that softmax can be merged to the last stage shader. It's used to optimize the generation shader. Since the https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc#L413 hasn't been applied to GQA, this PR can also benefit the prefill time.

Next step, I can move/reuse CopyKVCache so that the generation shader can also combine QKT into one shader.

This reverts commit f774cb8.

This reverts commit 1bf3739.

This reverts commit ee42c55.

This reverts commit e39c6b0.

qjia7 · 2025-02-19T07:46:20Z

@sushraja-msft @guschmue, have restored this PR to only add the components support. And will put the fa support to a separate PR. Please take a look, thanks.

qjia7 · 2025-02-20T09:58:35Z

@guschmue It seems that the 2 failing CIs are not related with my changes. Can this PR be merged?

guschmue · 2025-02-20T16:30:39Z

yeap

For phi3.5-gqa-static sum_long(>1000 tokens) on meteor lake. Before: 300 tokens in 27.0sec, e2e:11.1 tps, prompt: 212.4 tps, gen: 14.2 tps, ttft: 5.85 sec After: 300 tokens in 23.0sec, e2e:13.0 tps, prompt: 248.9 tps, gen: 16.6 tps, ttft: 4.99 sec

qjia7 added 5 commits February 17, 2025 15:48

Use components for VxAttentionScore

48e4a5f

For phi3.5-gqa-static sum_long(>1000 tokens) on meteor lake. Before: 300 tokens in 27.0sec, e2e:11.1 tps, prompt: 212.4 tps, gen: 14.2 tps, ttft: 5.85 sec After: 300 tokens in 23.0sec, e2e:13.0 tps, prompt: 248.9 tps, gen: 16.6 tps, ttft: 4.99 sec

use flash attention to merge softmax to the last stage

e39c6b0

use f32 instead of f16 for max/sum

ee42c55

make it work with GQA

1bf3739

recover ApplyFlashAttention

f774cb8

qjia7 changed the title ~~[webgpu] Use components for VxAttentionScore~~ [webgpu] Use components for VxAttentionScore and merge softmax to it Feb 18, 2025

qjia7 requested review from guschmue and sushraja-msft February 18, 2025 09:41

qjia7 added 4 commits February 19, 2025 12:51

Revert "recover ApplyFlashAttention"

abbfef2

This reverts commit f774cb8.

Revert "make it work with GQA"

ad528a7

This reverts commit 1bf3739.

Revert "use f32 instead of f16 for max/sum"

f7f758d

This reverts commit ee42c55.

Revert "use flash attention to merge softmax to the last stage"

3c03d57

This reverts commit e39c6b0.

qjia7 changed the title ~~[webgpu] Use components for VxAttentionScore and merge softmax to it~~ [webgpu] Use components for VxAttentionScore Feb 19, 2025

guschmue approved these changes Feb 19, 2025

View reviewed changes

Merge branch 'main' into attention_opt

33e428e

guschmue added the ep:WebGPU ort-web webgpu provider label Feb 20, 2025

guschmue merged commit 9b2b2ee into main Feb 20, 2025
95 of 97 checks passed

guschmue deleted the attention_opt branch February 20, 2025 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[webgpu] Use components for VxAttentionScore#23726

[webgpu] Use components for VxAttentionScore#23726
guschmue merged 10 commits intomainfrom
attention_opt

qjia7 commented Feb 17, 2025 •

edited

Loading

Uh oh!

qjia7 commented Feb 18, 2025

Uh oh!

qjia7 commented Feb 19, 2025

Uh oh!

qjia7 commented Feb 20, 2025

Uh oh!

guschmue commented Feb 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

qjia7 commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qjia7 commented Feb 18, 2025

Uh oh!

qjia7 commented Feb 19, 2025

Uh oh!

qjia7 commented Feb 20, 2025

Uh oh!

guschmue commented Feb 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qjia7 commented Feb 17, 2025 •

edited

Loading