Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) #130250

Chillee · 2024-07-08T16:23:12Z

Stack from ghstack (oldest at bottom):

-> Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) #130250
Add block mask utility support for batches and heads > 1 #130227

After this PR, our numerical error is within 3% of FA2 for forward and gradients. Prior, for dq our numerical error was 30% higher. I also added a PRESCALE_QK kernel option that increases perf by about 3-4% but incurs about 20-30% more numerical error.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-07-08T16:23:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130250

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1db7a9c with merge base 6875179 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 52fecb0 Pull Request resolved: #130250

vadimkantorov · 2024-07-08T17:22:35Z

@Chillee would FlexAttention be made available as a backend for SDPA? E.g. will this enable FAv2 impl with custom attn_bias? (at least for forward pass)

Is it polishing the triton impl of FAv2? I remember reading in issues that its perf was behind CUDA version of FAv2... And also people complained of slow backward...

torch/_inductor/kernel/flex_attention.py

drisspg · 2024-07-08T17:34:49Z

torch/_inductor/kernel/flex_attention.py

-            )
-            return subgraph_buffer
+
+            def convert_output_node_to_buffer(output):


nit: I find inlined functions really hard to read especially if they are multiple levels of nested deep.. probs why I struggle with PT2 lol

I like inlined functions particularly in cases like this, because it keeps the definition of the function close to the actual usage.

torch/_inductor/kernel/flex_attention.py

torch/nn/attention/_flex_attention.py

drisspg

left some comments, mostly nits

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

…t FlexAttention numerics to be as accurate as FA2)" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

izaitsevfb · 2024-07-09T22:31:05Z

@pytorchbot revert -m "depends on #130227 which needs to be reverted" -c ghfirst

pytorchmergebot · 2024-07-09T22:32:45Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…lexAttention numerics to be as accurate as FA2) (#130250)" This reverts commit 3e48d92. Reverted #130250 on behalf of https://github.com/izaitsevfb due to depends on #130227 which needs to be reverted ([comment](#130250 (comment)))

pytorchmergebot · 2024-07-09T22:32:57Z

@Chillee your PR has been successfully reverted.

…t FlexAttention numerics to be as accurate as FA2)" After this PR, our numerical error is within 3% of FA2 for forward and gradients. Prior, for `dq` our numerical error was 30% higher. I also added a `PRESCALE_QK` kernel option that increases perf by about 3-4% but incurs about 20-30% more numerical error. ![image](https://github.com/pytorch/pytorch/assets/6355099/7b5ff44e-219b-4a05-8a1b-2a0182c01ab2) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 4c8cd50 Pull Request resolved: #130250

Chillee · 2024-07-10T02:42:35Z

@pytorchbot merge

pytorchmergebot · 2024-07-10T02:44:28Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…lexAttention numerics to be as accurate as FA2) (pytorch#130250)" This reverts commit 3e48d92. Reverted pytorch#130250 on behalf of https://github.com/izaitsevfb due to depends on pytorch#130227 which needs to be reverted ([comment](pytorch#130250 (comment)))

pytorchmergebot · 2024-07-10T08:43:02Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

Chillee · 2024-07-10T16:12:28Z

@pytorchbot merge

pytorchmergebot · 2024-07-10T16:14:30Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…lexAttention numerics to be as accurate as FA2) (pytorch#130250)" This reverts commit 3e48d92. Reverted pytorch#130250 on behalf of https://github.com/izaitsevfb due to depends on pytorch#130227 which needs to be reverted ([comment](pytorch#130250 (comment)))

…tion numerics to be as accurate as FA2) (pytorch#130250) Pull Request resolved: pytorch#130250 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#130160, pytorch#130106, pytorch#130224, pytorch#130227

…lexAttention numerics to be as accurate as FA2) (pytorch#130250)" This reverts commit 3e48d92. Reverted pytorch#130250 on behalf of https://github.com/izaitsevfb due to depends on pytorch#130227 which needs to be reverted ([comment](pytorch#130250 (comment)))

…tion numerics to be as accurate as FA2) (pytorch#130250) After this PR, our numerical error is within 3% of FA2 for forward and gradients. Prior, for `dq` our numerical error was 30% higher. I also added a `PRESCALE_QK` kernel option that increases perf by about 3-4% but incurs about 20-30% more numerical error. ![image](https://github.com/pytorch/pytorch/assets/6355099/7b5ff44e-219b-4a05-8a1b-2a0182c01ab2) Pull Request resolved: pytorch#130250 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#130227

Add scale kwarg to FlexAttention

56ad983

[ghstack-poisoned]

Chillee requested review from albanD, jbschlosser and mikaylagawarecki as code owners July 8, 2024 16:23

Chillee mentioned this pull request Jul 8, 2024

Fix indexing twice with score_mod #130224

Closed

Chillee mentioned this pull request Jul 8, 2024

Add block mask utility support for batches and heads > 1 #130227

Closed

pytorch-bot bot added ciflow/inductor module: dynamo module: inductor labels Jul 8, 2024

Chillee added a commit that referenced this pull request Jul 8, 2024

Add scale kwarg to FlexAttention

20210a7

ghstack-source-id: 52fecb0 Pull Request resolved: #130250

github-actions bot requested a review from ezyang July 8, 2024 16:23

Chillee requested review from drisspg and yanboliang July 8, 2024 16:24