Skip to content

Comments

[webgpu] Use components for VxAttentionScore#23726

Merged
guschmue merged 10 commits intomainfrom
attention_opt
Feb 20, 2025
Merged

[webgpu] Use components for VxAttentionScore#23726
guschmue merged 10 commits intomainfrom
attention_opt

Conversation

@qjia7
Copy link
Contributor

@qjia7 qjia7 commented Feb 17, 2025

For phi3.5-gqa-static sum_long(>1000 tokens) on meteor lake.

Before:
300 tokens in 27.0sec, e2e:11.1 tps, prompt: 212.4 tps, gen: 14.2 tps, ttft: 5.85 sec

After:
300 tokens in 23.0sec, e2e:13.0 tps, prompt: 248.9 tps, gen: 16.6 tps, ttft: 4.99 sec

For phi3.5-gqa-static sum_long(>1000 tokens) on meteor lake.
Before:
300 tokens in 27.0sec, e2e:11.1 tps, prompt: 212.4 tps, gen: 14.2 tps, ttft: 5.85 sec
After:
300 tokens in 23.0sec, e2e:13.0 tps, prompt: 248.9 tps, gen: 16.6 tps, ttft: 4.99 sec
@qjia7 qjia7 changed the title [webgpu] Use components for VxAttentionScore [webgpu] Use components for VxAttentionScore and merge softmax to it Feb 18, 2025
@qjia7
Copy link
Contributor Author

qjia7 commented Feb 18, 2025

@sushraja-msft @guschmue This PR applies the flash attention's logic so that softmax can be merged to the last stage shader. It's used to optimize the generation shader. Since the https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc#L413 hasn't been applied to GQA, this PR can also benefit the prefill time.

Next step, I can move/reuse CopyKVCache so that the generation shader can also combine QKT into one shader.

@qjia7 qjia7 changed the title [webgpu] Use components for VxAttentionScore and merge softmax to it [webgpu] Use components for VxAttentionScore Feb 19, 2025
@qjia7
Copy link
Contributor Author

qjia7 commented Feb 19, 2025

@sushraja-msft @guschmue, have restored this PR to only add the components support. And will put the fa support to a separate PR. Please take a look, thanks.

@qjia7
Copy link
Contributor Author

qjia7 commented Feb 20, 2025

@guschmue It seems that the 2 failing CIs are not related with my changes. Can this PR be merged?

@guschmue
Copy link
Contributor

yeap

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Feb 20, 2025
@guschmue guschmue merged commit 9b2b2ee into main Feb 20, 2025
95 of 97 checks passed
@guschmue guschmue deleted the attention_opt branch February 20, 2025 17:08
guschmue pushed a commit that referenced this pull request Mar 6, 2025
For phi3.5-gqa-static sum_long(>1000 tokens) on meteor lake.

Before:
300 tokens in 27.0sec, e2e:11.1 tps, prompt: 212.4 tps, gen: 14.2 tps,
ttft: 5.85 sec

After:
300 tokens in 23.0sec, e2e:13.0 tps, prompt: 248.9 tps, gen: 16.6 tps,
ttft: 4.99 sec
ashrit-ms pushed a commit that referenced this pull request Mar 17, 2025
For phi3.5-gqa-static sum_long(>1000 tokens) on meteor lake.

Before:
300 tokens in 27.0sec, e2e:11.1 tps, prompt: 212.4 tps, gen: 14.2 tps,
ttft: 5.85 sec

After:
300 tokens in 23.0sec, e2e:13.0 tps, prompt: 248.9 tps, gen: 16.6 tps,
ttft: 4.99 sec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants