Skip to content

[CUDA] Update sm check for flash attention#24584

Merged
tianleiwu merged 1 commit intomainfrom
tlwu/enable_flash_attn_blackwell
Apr 29, 2025
Merged

[CUDA] Update sm check for flash attention#24584
tianleiwu merged 1 commit intomainfrom
tlwu/enable_flash_attn_blackwell

Conversation

@tianleiwu
Copy link
Contributor

@tianleiwu tianleiwu commented Apr 29, 2025

Description

Currently, flash attention is only enabled for sm8x and sm90. That means blackwell GPU will not use flash attention. This change is enable flash attention for sm > 90.

Note that the flash attention implementation is not optimized for blackwell, but shall be able to run in blackwell GPU.

Future works:

Motivation and Context

ORT GENAI is slow in RTX 5090

@tianleiwu tianleiwu marked this pull request as draft April 29, 2025 00:09
@tianleiwu tianleiwu marked this pull request as ready for review April 29, 2025 00:32
@tianleiwu tianleiwu merged commit 4adef01 into main Apr 29, 2025
82 of 88 checks passed
@tianleiwu tianleiwu deleted the tlwu/enable_flash_attn_blackwell branch April 29, 2025 03:32
vraspar pushed a commit that referenced this pull request May 1, 2025
### Description

Currently, flash attention is only enabled for sm8x and sm90. That means
blackwell GPU will not use flash attention. This change is enable flash
attention for sm > 90.

Note that the flash attention implementation is not optimized for
blackwell, but shall be able to run in blackwell GPU.

Future works:
* Integrate flash attn for hopper:
https://github.com/Dao-AILab/flash-attention/tree/main/hopper
* Integrate fmha for blackwell:
https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha
* Update cudnn and cudnn frontend to latest version (so that we can use
the cudnn flash attention for blackwell).

### Motivation and Context
ORT GENAI is slow in RTX 5090
jywu-msft pushed a commit that referenced this pull request May 1, 2025
### Description

Cherry pick the following into
[rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0)

- (#24491)
- (#24509)
- (#24564)
- (#24574)
- (#24582)
- (#24584)
- (#24568)
- (#24587)
- (#24563)
- (#24592)
- (#24526)
- (#24552)
- (#24588)
- (#24605)
- (#24606)

---------

Co-authored-by: Jing Fang <[email protected]>
Co-authored-by: Tianlei Wu <[email protected]>
Co-authored-by: Baiju Meswani <[email protected]>
Co-authored-by: Scott McKay <[email protected]>
Co-authored-by: Mark Schofield <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Edward Chen <[email protected]>
Co-authored-by: Ashwath Shankarnarayan <[email protected]>
Co-authored-by: saurabh <[email protected]>
Co-authored-by: Adrian Lizarraga <[email protected]>
Co-authored-by: Hector Li <[email protected]>
ankitm3k pushed a commit to intel/onnxruntime that referenced this pull request May 12, 2025
### Description

Currently, flash attention is only enabled for sm8x and sm90. That means
blackwell GPU will not use flash attention. This change is enable flash
attention for sm > 90.

Note that the flash attention implementation is not optimized for
blackwell, but shall be able to run in blackwell GPU.

Future works:
* Integrate flash attn for hopper:
https://github.com/Dao-AILab/flash-attention/tree/main/hopper
* Integrate fmha for blackwell:
https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha
* Update cudnn and cudnn frontend to latest version (so that we can use
the cudnn flash attention for blackwell).

### Motivation and Context
ORT GENAI is slow in RTX 5090
@snnn
Copy link
Contributor

snnn commented Sep 5, 2025

This PR has been included in the rel-1.22.0 branch. Removing the release:1.22.0 label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants