[CUDA] Update sm check for flash attention by tianleiwu · Pull Request #24584 · microsoft/onnxruntime

tianleiwu · 2025-04-29T00:09:08Z

Description

Currently, flash attention is only enabled for sm8x and sm90. That means blackwell GPU will not use flash attention. This change is enable flash attention for sm > 90.

Note that the flash attention implementation is not optimized for blackwell, but shall be able to run in blackwell GPU.

Future works:

Integrate flash attn for hopper: https://github.com/Dao-AILab/flash-attention/tree/main/hopper
Integrate fmha for blackwell: https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha
Update cudnn and cudnn frontend to latest version (so that we can use the cudnn flash attention for blackwell).

Motivation and Context

ORT GENAI is slow in RTX 5090

### Description Currently, flash attention is only enabled for sm8x and sm90. That means blackwell GPU will not use flash attention. This change is enable flash attention for sm > 90. Note that the flash attention implementation is not optimized for blackwell, but shall be able to run in blackwell GPU. Future works: * Integrate flash attn for hopper: https://github.com/Dao-AILab/flash-attention/tree/main/hopper * Integrate fmha for blackwell: https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha * Update cudnn and cudnn frontend to latest version (so that we can use the cudnn flash attention for blackwell). ### Motivation and Context ORT GENAI is slow in RTX 5090

### Description Cherry pick the following into [rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0) - (#24491) - (#24509) - (#24564) - (#24574) - (#24582) - (#24584) - (#24568) - (#24587) - (#24563) - (#24592) - (#24526) - (#24552) - (#24588) - (#24605) - (#24606) --------- Co-authored-by: Jing Fang <[email protected]> Co-authored-by: Tianlei Wu <[email protected]> Co-authored-by: Baiju Meswani <[email protected]> Co-authored-by: Scott McKay <[email protected]> Co-authored-by: Mark Schofield <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Ashwath Shankarnarayan <[email protected]> Co-authored-by: saurabh <[email protected]> Co-authored-by: Adrian Lizarraga <[email protected]> Co-authored-by: Hector Li <[email protected]>

### Description Currently, flash attention is only enabled for sm8x and sm90. That means blackwell GPU will not use flash attention. This change is enable flash attention for sm > 90. Note that the flash attention implementation is not optimized for blackwell, but shall be able to run in blackwell GPU. Future works: * Integrate flash attn for hopper: https://github.com/Dao-AILab/flash-attention/tree/main/hopper * Integrate fmha for blackwell: https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha * Update cudnn and cudnn frontend to latest version (so that we can use the cudnn flash attention for blackwell). ### Motivation and Context ORT GENAI is slow in RTX 5090

snnn · 2025-09-05T20:48:57Z

This PR has been included in the rel-1.22.0 branch. Removing the release:1.22.0 label.

loose constraint on sm for flash attention

8ecafdf

tianleiwu marked this pull request as draft April 29, 2025 00:09

tianleiwu marked this pull request as ready for review April 29, 2025 00:32

tianleiwu requested review from aciddelgado and hanbitmyths April 29, 2025 00:32

tianleiwu added the release:1.22.0 label Apr 29, 2025

hanbitmyths approved these changes Apr 29, 2025

View reviewed changes

baijumeswani approved these changes Apr 29, 2025

View reviewed changes

tianleiwu merged commit 4adef01 into main Apr 29, 2025
82 of 88 checks passed

tianleiwu deleted the tlwu/enable_flash_attn_blackwell branch April 29, 2025 03:32

vraspar mentioned this pull request May 1, 2025

Cherry-picks into rel-1.22.0 #24611

Merged

snnn removed the release:1.22.0 label Sep 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Update sm check for flash attention#24584

[CUDA] Update sm check for flash attention#24584
tianleiwu merged 1 commit intomainfrom
tlwu/enable_flash_attn_blackwell

tianleiwu commented Apr 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

snnn commented Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tianleiwu commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Uh oh!

snnn commented Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tianleiwu commented Apr 29, 2025 •

edited

Loading