[FlexAttention] add support for learnable biases in Inductor #137452

drisspg · 2024-10-07T23:09:13Z

Stack from ghstack (oldest at bottom):

-> [FlexAttention] add support for learnable biases in Inductor #137452

Summary

The follow up PR to: #137526. In this pr, we actually update the lowerings for the flex_attention backwards kernel to generate fused backward gradient calculations for any captured buffers that require grads.

We are doing this using tl.atomic_add to scatter the correct gradients into zeroed out buffer for any captured buffers that required grads. Added many test cases and found. Along the way found some masking bugs.

There are likely some performance cliffs here, specifically with D-types and on different GPUs. Planned to do this in a follow-up and profile the current strategy. We are explicitly choosing reduced memory over increased performance right now.

By using atomics, we do not need to realize a full attention scores matrix. However, this comes with two downsides. One, this is potentially slower in some cases, and two, the gradient calculation for any captured buffers is non-deterministic.

Worked Example

Lets do the case where you are reading from one bias that doesn't require grad and using this to index into another that does.

ScoreMod:

bias = torch.randn(
    params.seq_length,
    device=self.device,
    dtype=params.dtype,
    requires_grad=True,
)

offset = torch.randint(
    0,
    params.seq_length,
    (params.seq_length,),
    device=self.device,
)

def score_mod(score, b, h, q_idx, kv_idx):
    return score + bias[offset[q_idx]]

I am removing all but the new subgraph injected into the backwards:

    dsT = pT * (dpT - Di[None, :])
    # ~~~~~~~~~~~~~~~~~~~ Apply joint modification  ~~~~~~~~~~~~~~~~~~~
    grad_scores = (dsT)


    # ~~~~~~~~~~~~~~~~~~~ Apply other buffer grad writes ~~~~~~~~~~~~~
    idx_b = off_z
    idx_h = off_hq
    idx_m = m
    idx_n = n
    scatter_mask = offs_m1[None, :] < Q_LEN and offs_n1[:, None] < KV_LEN
    tmp4 = (dsT).to(tl.float32)
    tl.atomic_add(out_ptr1 + (tl.broadcast_to(tl.load(in_ptr16 + idx_m), tmp4.shape)), tmp4, scatter_mask, sem='relaxed')

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Key points

We always accumulate to float 32 grad buffers regardless of the type in the forward. This is because we normally do all computation intra kernel w/ fp32 accumulation and we want the same behavior for atomic additions
We are currently restricted to 1 scatter in the kenrel. I have some ideas on fx rewrites that would remove this restrictions but for now have nice error message w/ work around and will leave as a follow up.
Will do more extensive performance/ memory profiling in a follow up.

Toy E2E example

I have a toy E2E training example PR in the gym for now: meta-pytorch/attention-gym#84
I plan to update to a realistic learnable bias before landing

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @Chillee @yanboliang @BoyuanFeng

[ghstack-poisoned]

pytorch-bot · 2024-10-07T23:09:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137452

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5508c30 with merge base 259a00b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

torch/_higher_order_ops/flex_attention.py

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 8d15f0a Pull Request resolved: #137452

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 56a8efe Pull Request resolved: #137452

[ghstack-poisoned]

ghstack-source-id: bad9658 Pull Request resolved: #137452

[ghstack-poisoned]

ghstack-source-id: 39c6d85 Pull Request resolved: #137452

[ghstack-poisoned]

ghstack-source-id: 46a14da Pull Request resolved: #137452

Chillee · 2024-11-23T22:52:45Z

!!

[ghstack-poisoned]

ghstack-source-id: 5eeee51 Pull Request resolved: #137452

[ghstack-poisoned]

ghstack-source-id: 026c347 Pull Request resolved: #137452

drisspg · 2024-11-25T19:01:13Z

@pytorchbot merge

pytorchmergebot · 2024-11-25T19:02:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…#137452) # Summary The follow up PR to: pytorch#137526. In this pr, we actually update the lowerings for the flex_attention backwards kernel to generate fused backward gradient calculations for any captured buffers that require grads. We are doing this using tl.atomic_add to scatter the correct gradients into zeroed out buffer for any captured buffers that required grads. Added many test cases and found. Along the way found some masking bugs. There are likely some performance cliffs here, specifically with D-types and on different GPUs. Planned to do this in a follow-up and profile the current strategy. We are explicitly choosing reduced memory over increased performance right now. By using atomics, we do not need to realize a full attention scores matrix. However, this comes with two downsides. One, this is potentially slower in some cases, and two, the gradient calculation for any captured buffers is non-deterministic. ## Worked Example Lets do the case where you are reading from one bias that doesn't require grad and using this to index into another that does. ScoreMod: ```Python bias = torch.randn( params.seq_length, device=self.device, dtype=params.dtype, requires_grad=True, ) offset = torch.randint( 0, params.seq_length, (params.seq_length,), device=self.device, ) def score_mod(score, b, h, q_idx, kv_idx): return score + bias[offset[q_idx]] ``` I am removing all but the new subgraph injected into the backwards: ``` Python dsT = pT * (dpT - Di[None, :]) # ~~~~~~~~~~~~~~~~~~~ Apply joint modification ~~~~~~~~~~~~~~~~~~~ grad_scores = (dsT) # ~~~~~~~~~~~~~~~~~~~ Apply other buffer grad writes ~~~~~~~~~~~~~ idx_b = off_z idx_h = off_hq idx_m = m idx_n = n scatter_mask = offs_m1[None, :] < Q_LEN and offs_n1[:, None] < KV_LEN tmp4 = (dsT).to(tl.float32) tl.atomic_add(out_ptr1 + (tl.broadcast_to(tl.load(in_ptr16 + idx_m), tmp4.shape)), tmp4, scatter_mask, sem='relaxed') # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` ## Key points * We always accumulate to float 32 grad buffers regardless of the type in the forward. This is because we normally do all computation intra kernel w/ fp32 accumulation and we want the same behavior for atomic additions * We are currently restricted to 1 scatter in the kenrel. I have some ideas on fx rewrites that would remove this restrictions but for now have nice error message w/ work around and will leave as a follow up. * Will do more extensive performance/ memory profiling in a follow up. ### Toy E2E example I have a toy E2E training example PR in the gym for now: meta-pytorch/attention-gym#84 I plan to update to a realistic learnable bias before landing Pull Request resolved: pytorch#137452 Approved by: https://github.com/Chillee

[FlexAttention] add support for learnable biases in Inductor

7a2a4ef

[ghstack-poisoned]

drisspg requested a review from zou3519 as a code owner October 7, 2024 23:09

drisspg mentioned this pull request Oct 7, 2024

[FlexAttention] only calculate grads for buffers that require_grad #137451

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Oct 7, 2024

drisspg marked this pull request as draft October 7, 2024 23:10

drisspg removed the request for review from zou3519 October 7, 2024 23:39

Update on "[FlexAttention] add support for learnable biases in Inductor"

46b67c9

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

Update on "[FlexAttention] add support for learnable biases in Inductor"

dbfaf0a

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

drisspg commented Oct 8, 2024

View reviewed changes

torch/_higher_order_ops/flex_attention.py Outdated Show resolved Hide resolved

Update on "[FlexAttention] add support for learnable biases in Inductor"

1caae0e

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

Update on "[FlexAttention] add support for learnable biases in Inductor"

f05a6c4

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

drisspg added a commit that referenced this pull request Oct 8, 2024

[FlexAttention] add support for learnable biases in Inductor

274e990

ghstack-source-id: 8d15f0a Pull Request resolved: #137452

Update on "[FlexAttention] add support for learnable biases in Inductor"

f78ce3c

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

drisspg added a commit that referenced this pull request Oct 8, 2024

[FlexAttention] add support for learnable biases in Inductor

22ffce1

ghstack-source-id: 56a8efe Pull Request resolved: #137452

Update

4ee0a76

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Oct 17, 2024

[FlexAttention] add support for learnable biases in Inductor

9b006a0

ghstack-source-id: bad9658 Pull Request resolved: #137452

Update

b5910a4

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Oct 29, 2024

[FlexAttention] add support for learnable biases in Inductor

47ec960

ghstack-source-id: 39c6d85 Pull Request resolved: #137452

Update

4bec143

[ghstack-poisoned]

drisspg added topic: not user facing topic category module: flex attention labels Nov 6, 2024

Update

70fa846

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Nov 7, 2024

[FlexAttention] add support for learnable biases in Inductor

31e51bc

ghstack-source-id: 46a14da Pull Request resolved: #137452

drisspg removed request for jbschlosser and mikaylagawarecki November 23, 2024 21:38

Chillee approved these changes Nov 23, 2024

View reviewed changes

Update

0929357

[ghstack-poisoned]

drisspg added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 24, 2024

drisspg added rocm This tag is for PRs from ROCm team ciflow/rocm Trigger "default" config CI on ROCm and removed rocm This tag is for PRs from ROCm team labels Nov 24, 2024

Update

0dc903b

[ghstack-poisoned]

Update

366cf6b

[ghstack-poisoned]

drisspg mentioned this pull request Nov 25, 2024

FlexAttention + ROCM Issue Tracker #140855

Open

Update

501df4e

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Nov 25, 2024

[FlexAttention] add support for learnable biases in Inductor

107c308

ghstack-source-id: 5eeee51 Pull Request resolved: #137452

Update

5508c30

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Nov 25, 2024

[FlexAttention] add support for learnable biases in Inductor

acdc61d

ghstack-source-id: 026c347 Pull Request resolved: #137452

pytorchmergebot added the merging label Nov 25, 2024

pytorchmergebot added the Merged label Nov 25, 2024

pytorchmergebot closed this in 91f7c54 Nov 25, 2024

pytorchmergebot removed the merging label Nov 25, 2024

drisspg mentioned this pull request Dec 12, 2024

Bias gradient support? meta-pytorch/attention-gym#20

Open

github-actions bot deleted the gh/drisspg/61/head branch December 26, 2024 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FlexAttention] add support for learnable biases in Inductor #137452

[FlexAttention] add support for learnable biases in Inductor #137452

Uh oh!

drisspg commented Oct 7, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Oct 7, 2024 •

edited

Loading

Uh oh!

Uh oh!

Chillee commented Nov 23, 2024

Uh oh!

drisspg commented Nov 25, 2024

Uh oh!

pytorchmergebot commented Nov 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[FlexAttention] add support for learnable biases in Inductor #137452

[FlexAttention] add support for learnable biases in Inductor #137452

Uh oh!

Conversation

drisspg commented Oct 7, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Worked Example

Key points

Toy E2E example

Uh oh!

pytorch-bot bot commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137452

✅ No Failures

Uh oh!

Uh oh!

Chillee commented Nov 23, 2024

Uh oh!

drisspg commented Nov 25, 2024

Uh oh!

pytorchmergebot commented Nov 25, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

drisspg commented Oct 7, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 7, 2024 •

edited

Loading