Elide calls to is_nested in Dynamo-traced graphs #138841

jbschlosser · 2024-10-24T19:00:51Z

Stack from ghstack (oldest at bottom):

Before this PR, calling is_nested in-graph would result in graph code like the following:

  class GraphModule(torch.nn.Module):
      def forward(self, L_nt_: "f64[3, s1, 5]", s1: "Sym(s1)"):
          l_nt_ = L_nt_

          # Note this useless line!
          getattr_1 = l_nt_.is_nested;  getattr_1 = None

          add: "f64[3, s1, 5]" = l_nt_ + 2;  l_nt_ = None
          return (add,)

This PR follows what is done for is_sparse / is_quantized: store it onto TensorVariable and have getattr calls to is_nested return the stored value as a constant. This removes the useless line above from the graph. Note that guarding is handled through tensor type check guards, so no need to guard on is_nested status.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @rec

[ghstack-poisoned]

pytorch-bot · 2024-10-24T19:00:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138841

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit db920c8 with merge base 239a21f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Before this PR, calling `is_nested` in-graph would result in graph code like the following: ```python class GraphModule(torch.nn.Module): def forward(self, L_nt_: "f64[3, s1, 5]", s1: "Sym(s1)"): l_nt_ = L_nt_ # Note this useless line! getattr_1 = l_nt_.is_nested; getattr_1 = None add: "f64[3, s1, 5]" = l_nt_ + 2; l_nt_ = None return (add,) ``` This PR follows what is done for `is_sparse` / `is_quantized`: store it onto `TensorVariable` and have `getattr` calls to `is_nested` return the stored value as a constant. Note that guarding is handled through tensor type check guards, so no need to guard on `is_nested` status. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

jbschlosser · 2024-10-25T16:49:55Z

@pytorchbot merge

pytorchmergebot · 2024-10-25T16:51:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-25T22:50:18Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

Skylion007 · 2024-10-26T15:01:35Z

@pytorchbot merge

pytorchmergebot · 2024-10-26T15:03:09Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This PR adds FlexAttention + NJT support. In particular: * To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user `score_mod` / `mask_mod` functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically. * Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately * Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported * Tests that FlexAttention with a causal mask matches causal SDPA * Adds a new public API for FlexAttention usage: * `create_nested_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile)` - NJT analogue for `create_block_mask()` that utilizes the `njt`'s ragged structure to create an appropriately-sized block mask (e.g. `(1, 1, total_seqlen, total_seqlen)`). This function handles the index conversion from "stacked sequence" space -> relative sequence space. * Minor note: as this is a public API, this function is purposefully named with "nested" instead of "njt" to keep the latter as an informal, mostly internal-only term. Example usage: ```python def causal_mask(b, h, q_idx, kv_idx): return q_idx >= kv_idx query = ... # NJT of shape (B, H, S*, D) key = ... # NJT of shape (B, H, S*, D) value = ... # NJT of shape (B, H, S*, D) # create_nested_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space block_mask = create_nested_block_mask(causal_mask, 1, 1, query) # block mask conceptual shape is (B, H, sum(S*), sum(S*)) output = flex_attention(query, key, value, block_mask=block_mask) def causal_score_mod(score, b, h, q_idx, kv_idx): return torch.where(q_idx >= kv_idx, score, float("-inf")) # flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs output2 = flex_attention(query, key, value, score_mod=causal_score_mod) ``` TODO: * ~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though * ~~Some cleanup~~ * ~~`njt_score_mod_adapter`~~ * ~~Q: should `create_njt_block_mask()` call `njt_mask_mod_adapter()` so we don't need two calls?~~ * Can we avoid materializing the `sum(s)` length `seq_idx` used for conversion between stacked sequence -> sequence relative indices? * Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though. * ~~Demonstrate non-causal mask~~ * Support non-contiguous NJTs with holes (**booted to future PR**) Pull Request resolved: #136792 Approved by: https://github.com/drisspg ghstack dependencies: #138841

This PR adds FlexAttention + NJT support. In particular: * To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user `score_mod` / `mask_mod` functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically. * Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately * Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported * Tests that FlexAttention with a causal mask matches causal SDPA * Adds a new public API for FlexAttention usage: * `create_nested_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile)` - NJT analogue for `create_block_mask()` that utilizes the `njt`'s ragged structure to create an appropriately-sized block mask (e.g. `(1, 1, total_seqlen, total_seqlen)`). This function handles the index conversion from "stacked sequence" space -> relative sequence space. * Minor note: as this is a public API, this function is purposefully named with "nested" instead of "njt" to keep the latter as an informal, mostly internal-only term. Example usage: ```python def causal_mask(b, h, q_idx, kv_idx): return q_idx >= kv_idx query = ... # NJT of shape (B, H, S*, D) key = ... # NJT of shape (B, H, S*, D) value = ... # NJT of shape (B, H, S*, D) # create_nested_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space block_mask = create_nested_block_mask(causal_mask, 1, 1, query) # block mask conceptual shape is (B, H, sum(S*), sum(S*)) output = flex_attention(query, key, value, block_mask=block_mask) def causal_score_mod(score, b, h, q_idx, kv_idx): return torch.where(q_idx >= kv_idx, score, float("-inf")) # flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs output2 = flex_attention(query, key, value, score_mod=causal_score_mod) ``` TODO: * ~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though * ~~Some cleanup~~ * ~~`njt_score_mod_adapter`~~ * ~~Q: should `create_njt_block_mask()` call `njt_mask_mod_adapter()` so we don't need two calls?~~ * Can we avoid materializing the `sum(s)` length `seq_idx` used for conversion between stacked sequence -> sequence relative indices? * Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though. * ~~Demonstrate non-causal mask~~ * Support non-contiguous NJTs with holes (**booted to future PR**) Pull Request resolved: pytorch#136792 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#138841

Elide calls to is_nested in Dynamo-traced graphs

877a6ca

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: dynamo labels Oct 24, 2024

This was referenced Oct 24, 2024

FlexAttention support for NJT #136792

Closed

Propagate NJT lengths through op calls #138098

Closed

jbschlosser requested a review from soulitzer October 24, 2024 19:01

jbschlosser added the topic: not user facing topic category label Oct 24, 2024

soulitzer approved these changes Oct 24, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 25, 2024

pytorchmergebot added the merging label Oct 25, 2024

pytorchmergebot added the Merged label Oct 26, 2024

pytorchmergebot closed this in 14a17ad Oct 26, 2024

pytorchmergebot removed the merging label Oct 26, 2024

github-actions bot deleted the gh/jbschlosser/194/head branch November 26, 2024 02:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Elide calls to is_nested in Dynamo-traced graphs #138841

Elide calls to is_nested in Dynamo-traced graphs #138841

Uh oh!

jbschlosser commented Oct 24, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 24, 2024 •

edited

Loading

Uh oh!

jbschlosser commented Oct 25, 2024

Uh oh!

pytorchmergebot commented Oct 25, 2024

Uh oh!

pytorchmergebot commented Oct 25, 2024

Uh oh!

Skylion007 commented Oct 26, 2024

Uh oh!

pytorchmergebot commented Oct 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Elide calls to is_nested in Dynamo-traced graphs #138841

Elide calls to is_nested in Dynamo-traced graphs #138841

Uh oh!

Conversation

jbschlosser commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138841

✅ No Failures

Uh oh!

jbschlosser commented Oct 25, 2024

Uh oh!

pytorchmergebot commented Oct 25, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 25, 2024

Uh oh!

Skylion007 commented Oct 26, 2024

Uh oh!

pytorchmergebot commented Oct 26, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jbschlosser commented Oct 24, 2024 •

edited

Loading

pytorch-bot bot commented Oct 24, 2024 •

edited

Loading