[CP]Introduce ContextParallal plan for parallelize_module() #162542

fegin · 2025-09-09T22:47:33Z

Stack from ghstack (oldest at bottom):

Motivation

Since FlexAttention and SDPA are both functions, not modules, we have tried numerous mechanisms to dispatch FlexAttention and SDPA to customized call paths so that we can inject the CP logic. Unfortunately, all of these approaches have their own composability issues with different techniques.

Candidate Approaches

Ask users to write a module to wrap FlexAttention/SDPA and use parallelize_module to install a forward hook.
- Pros: This is similar to how we do TP.
- Cons: 1) It is cumbersome for users as they need to create a new module. 2) We need two places to parallelize the CP, as a context_parallel context manager is still required for splitting the inputs.
Provide a function wrapper.
- Pros: Users just need to replace their FlexAttention/SDPA calls with the wrapper.
- Cons: It is not the same API, though we can maintain the API signatures to be the same as the core API.

Summary

~~This PR implements approach 2 and refactor the code in such a way that most code can be used by option approach 1, which will be introduced in another PR.~~

We changed this PR to implement option 1 as people like option 1 due to the consistency with the existing parallelisms. But this PR can also serve the foundation to implement option 2, which was the early version of this PR.

This PR also changes create_cp_block_mask logic since we now only focus on ModuleWrapper approach which doesn't require to hack the seq_len field in a BlockMask.

This PR also removes TorchFunctionMode dispatcher mode as it doesn't work well with SAC.

cc @H-Huang @awgu @wanchaol @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci

[ghstack-poisoned]

pytorch-bot · 2025-09-09T22:47:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162542

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a18e4eb with merge base d41aa18 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

**Motivation** Since FlexAttention and SDPA are both functions, not modules, we have tried numerous mechanisms to dispatch FlexAttention and SDPA to customized call paths so that we can inject the CP logic. Unfortunately, all of these approaches have their own composability issues with different techniques. **Candidate Approaches** 1. Ask users to write a module to wrap FlexAttention/SDPA and use `parallelize_module` to install a forward hook. - Pros: This is similar to how we do TP. - Cons: 1) It is cumbersome for users as they need to create a new module. 2) We need two places to parallelize the CP, as a context_parallel context switch is still required for splitting the inputs. 2. Provide a function wrapper. - Pros: Users just need to replace their FlexAttention/SDPA calls with the wrapper. - Cons: It is not the same API, though we can maintain the API signatures to be the same as the core API. **Summary** This PR implements approach 2 but uses a nn.Module to mimic a function wrapper so that we can record CP state in the module instead of leaking it to the `_attention` module. ghstack-source-id: 0142083 Pull-Request-resolved: #162542

[ghstack-poisoned]

**Motivation** Since FlexAttention and SDPA are both functions, not modules, we have tried numerous mechanisms to dispatch FlexAttention and SDPA to customized call paths so that we can inject the CP logic. Unfortunately, all of these approaches have their own composability issues with different techniques. **Candidate Approaches** 1. Ask users to write a module to wrap FlexAttention/SDPA and use `parallelize_module` to install a forward hook. - Pros: This is similar to how we do TP. - Cons: 1) It is cumbersome for users as they need to create a new module. 2) We need two places to parallelize the CP, as a context_parallel context switch is still required for splitting the inputs. 2. Provide a function wrapper. - Pros: Users just need to replace their FlexAttention/SDPA calls with the wrapper. - Cons: It is not the same API, though we can maintain the API signatures to be the same as the core API. **Summary** This PR implements approach 2 but uses a nn.Module to mimic a function wrapper so that we can record CP state in the module instead of leaking it to the `_attention` module. ghstack-source-id: 0355010 Pull-Request-resolved: #162542

[ghstack-poisoned]

**Motivation** Since FlexAttention and SDPA are both functions, not modules, we have tried numerous mechanisms to dispatch FlexAttention and SDPA to customized call paths so that we can inject the CP logic. Unfortunately, all of these approaches have their own composability issues with different techniques. **Candidate Approaches** 1. Ask users to write a module to wrap FlexAttention/SDPA and use `parallelize_module` to install a forward hook. - Pros: This is similar to how we do TP. - Cons: 1) It is cumbersome for users as they need to create a new module. 2) We need two places to parallelize the CP, as a context_parallel context switch is still required for splitting the inputs. 2. Provide a function wrapper. - Pros: Users just need to replace their FlexAttention/SDPA calls with the wrapper. - Cons: It is not the same API, though we can maintain the API signatures to be the same as the core API. **Summary** This PR implements approach 2 but uses a nn.Module to mimic a function wrapper so that we can record CP state in the module instead of leaking it to the `_attention` module. ghstack-source-id: 0355010 Pull-Request-resolved: #162542

This PR requires pytorch/pytorch#162542

[ghstack-poisoned]

**Motivation** Since FlexAttention and SDPA are both functions, not modules, we have tried numerous mechanisms to dispatch FlexAttention and SDPA to customized call paths so that we can inject the CP logic. Unfortunately, all of these approaches have their own composability issues with different techniques. **Candidate Approaches** 1. Ask users to write a module to wrap FlexAttention/SDPA and use `parallelize_module` to install a forward hook. - Pros: This is similar to how we do TP. - Cons: 1) It is cumbersome for users as they need to create a new module. 2) We need two places to parallelize the CP, as a context_parallel context switch is still required for splitting the inputs. 2. Provide a function wrapper. - Pros: Users just need to replace their FlexAttention/SDPA calls with the wrapper. - Cons: It is not the same API, though we can maintain the API signatures to be the same as the core API. **Summary** This PR implements approach 2 but uses a nn.Module to mimic a function wrapper so that we can record CP state in the module instead of leaking it to the `_attention` module. ghstack-source-id: ba158c5 Pull-Request-resolved: #162542

Similar to #1696, but this PR uses parallel_module similar to TP/SP. This PR also requires pytorch/pytorch#162542

[ghstack-poisoned]

**Motivation** Since FlexAttention and SDPA are both functions, not modules, we have tried numerous mechanisms to dispatch FlexAttention and SDPA to customized call paths so that we can inject the CP logic. Unfortunately, all of these approaches have their own composability issues with different techniques. **Candidate Approaches** 1. Ask users to write a module to wrap FlexAttention/SDPA and use `parallelize_module` to install a forward hook. - Pros: This is similar to how we do TP. - Cons: 1) It is cumbersome for users as they need to create a new module. 2) We need two places to parallelize the CP, as a context_parallel context switch is still required for splitting the inputs. 2. Provide a function wrapper. - Pros: Users just need to replace their FlexAttention/SDPA calls with the wrapper. - Cons: It is not the same API, though we can maintain the API signatures to be the same as the core API. **Summary** This PR implements approach 2 but uses a nn.Module to mimic a function wrapper so that we can record CP state in the module instead of leaking it to the `_attention` module. ghstack-source-id: 9f6f679 Pull-Request-resolved: #162542

[ghstack-poisoned]

**Motivation** Since FlexAttention and SDPA are both functions, not modules, we have tried numerous mechanisms to dispatch FlexAttention and SDPA to customized call paths so that we can inject the CP logic. Unfortunately, all of these approaches have their own composability issues with different techniques. **Candidate Approaches** 1. Ask users to write a module to wrap FlexAttention/SDPA and use `parallelize_module` to install a forward hook. - Pros: This is similar to how we do TP. - Cons: 1) It is cumbersome for users as they need to create a new module. 2) We need two places to parallelize the CP, as a context_parallel context switch is still required for splitting the inputs. 2. Provide a function wrapper. - Pros: Users just need to replace their FlexAttention/SDPA calls with the wrapper. - Cons: It is not the same API, though we can maintain the API signatures to be the same as the core API. **Summary** This PR implements approach 2 but uses a nn.Module to mimic a function wrapper so that we can record CP state in the module instead of leaking it to the `_attention` module. ghstack-source-id: 1658e02 Pull-Request-resolved: #162542

[ghstack-poisoned]

**Motivation** Since FlexAttention and SDPA are both functions, not modules, we have tried numerous mechanisms to dispatch FlexAttention and SDPA to customized call paths so that we can inject the CP logic. Unfortunately, all of these approaches have their own composability issues with different techniques. **Candidate Approaches** 1. Ask users to write a module to wrap FlexAttention/SDPA and use `parallelize_module` to install a forward hook. - Pros: This is similar to how we do TP. - Cons: 1) It is cumbersome for users as they need to create a new module. 2) We need two places to parallelize the CP, as a context_parallel context switch is still required for splitting the inputs. 2. Provide a function wrapper. - Pros: Users just need to replace their FlexAttention/SDPA calls with the wrapper. - Cons: It is not the same API, though we can maintain the API signatures to be the same as the core API. **Summary** This PR implements approach 2 but uses a nn.Module to mimic a function wrapper so that we can record CP state in the module instead of leaking it to the `_attention` module. ghstack-source-id: e51d35b Pull-Request-resolved: #162542

[ghstack-poisoned]

**Motivation** Since FlexAttention and SDPA are both functions, not modules, we have tried numerous mechanisms to dispatch FlexAttention and SDPA to customized call paths so that we can inject the CP logic. Unfortunately, all of these approaches have their own composability issues with different techniques. **Candidate Approaches** 1. Ask users to write a module to wrap FlexAttention/SDPA and use `parallelize_module` to install a forward hook. - Pros: This is similar to how we do TP. - Cons: 1) It is cumbersome for users as they need to create a new module. 2) We need two places to parallelize the CP, as a context_parallel context switch is still required for splitting the inputs. 2. Provide a function wrapper. - Pros: Users just need to replace their FlexAttention/SDPA calls with the wrapper. - Cons: It is not the same API, though we can maintain the API signatures to be the same as the core API. **Summary** This PR implements approach 2 but uses a nn.Module to mimic a function wrapper so that we can record CP state in the module instead of leaking it to the `_attention` module. ghstack-source-id: 39b417a Pull-Request-resolved: #162542

[ghstack-poisoned]

fegin · 2025-10-09T22:11:26Z

@pytorchbot merge

pytorchmergebot · 2025-10-09T22:13:17Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-10-09T22:24:11Z

Merge failed

Reason: 3 jobs have failed, first few of them are: trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 4, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu), trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 2, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu), trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 5, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

[ghstack-poisoned]

fegin · 2025-10-10T05:50:37Z

@pytorchbot merge

pytorchmergebot · 2025-10-10T05:52:34Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-10-10T06:08:40Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-jammy-py3.13-clang12 / test (default, 1, 5, lf.linux.4xlarge)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

[ghstack-poisoned]

fegin · 2025-10-10T18:22:42Z

@pytorchbot merge

pytorchmergebot · 2025-10-10T18:24:41Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…164500) `context_parallel()` being a context manager has annoyed users. Now that we plan to redesign CP's UX to explicitly ask users to: 1. Wrap the attention op into an `nn.Module` 2. Lift any buffers that are not sequence agnostic to input We can replace `context_parallel()` with two functional APIs: `_context_parallel_shard` and `_enable_context_parallel_dispatcher`. Pull Request resolved: #164500 Approved by: https://github.com/XilunWu ghstack dependencies: #162542

jeffdaily · 2025-10-13T17:11:11Z

test/distributed/tensor/test_attention.py

+        # TODO: Reverify atol and rtol after
+        # https://github.com/pytorch/pytorch/pull/163185 is landed. The accuracy
+        # issue happens on the gradients.
+        torch.use_deterministic_algorithms(True)


Please don't do things like this. It might subtly break other tests since this globally changes the deterministic setting. Please only set deterministic on the tests you need it for, and use a context manager so that it resets back to prior state.

The custom op will fetch the required K and V. Currently, the forward pass is just an all-gather, and the backward pass is a reduce-scatter. While the logic is the same as all_gather_tensor_autograd, the custom op avoids the Autograd warning that wait_tensor() is registered to autograd. For the next step, we should explore how to interpolate the required communication based on the information from BlockMask. Pull Request resolved: #163185 Approved by: https://github.com/XilunWu ghstack dependencies: #162542, #164500

…5039) No logic change, just polish the docstrings, comments and remove unused variables Pull Request resolved: #165039 Approved by: https://github.com/XilunWu ghstack dependencies: #162542, #164500, #163185

…orch#165039) No logic change, just polish the docstrings, comments and remove unused variables Pull Request resolved: pytorch#165039 Approved by: https://github.com/XilunWu ghstack dependencies: pytorch#162542, pytorch#164500, pytorch#163185

…162542) **Motivation** Since FlexAttention and SDPA are both functions, not modules, we have tried numerous mechanisms to dispatch FlexAttention and SDPA to customized call paths so that we can inject the CP logic. Unfortunately, all of these approaches have their own composability issues with different techniques. **Candidate Approaches** 1. Ask users to write a module to wrap FlexAttention/SDPA and use `parallelize_module` to install a forward hook. - Pros: This is similar to how we do TP. - Cons: 1) It is cumbersome for users as they need to create a new module. 2) We need two places to parallelize the CP, as a context_parallel context manager is still required for splitting the inputs. 2. Provide a function wrapper. - Pros: Users just need to replace their FlexAttention/SDPA calls with the wrapper. - Cons: It is not the same API, though we can maintain the API signatures to be the same as the core API. **Summary** ~~This PR implements approach 2 and refactor the code in such a way that most code can be used by option approach 1, which will be introduced in another PR.~~ We changed this PR to implement option 1 as people like option 1 due to the consistency with the existing parallelisms. But this PR can also serve the foundation to implement option 2, which was the early version of this PR. This PR also changes `create_cp_block_mask` logic since we now only focus on ModuleWrapper approach which doesn't require to hack the seq_len field in a BlockMask. This PR also removes TorchFunctionMode dispatcher mode as it doesn't work well with SAC. Pull Request resolved: pytorch#162542 Approved by: https://github.com/XilunWu

…ytorch#164500) `context_parallel()` being a context manager has annoyed users. Now that we plan to redesign CP's UX to explicitly ask users to: 1. Wrap the attention op into an `nn.Module` 2. Lift any buffers that are not sequence agnostic to input We can replace `context_parallel()` with two functional APIs: `_context_parallel_shard` and `_enable_context_parallel_dispatcher`. Pull Request resolved: pytorch#164500 Approved by: https://github.com/XilunWu ghstack dependencies: pytorch#162542

…h#163185) The custom op will fetch the required K and V. Currently, the forward pass is just an all-gather, and the backward pass is a reduce-scatter. While the logic is the same as all_gather_tensor_autograd, the custom op avoids the Autograd warning that wait_tensor() is registered to autograd. For the next step, we should explore how to interpolate the required communication based on the information from BlockMask. Pull Request resolved: pytorch#163185 Approved by: https://github.com/XilunWu ghstack dependencies: pytorch#162542, pytorch#164500

…orch#165039) No logic change, just polish the docstrings, comments and remove unused variables Pull Request resolved: pytorch#165039 Approved by: https://github.com/XilunWu ghstack dependencies: pytorch#162542, pytorch#164500, pytorch#163185

Update

6d6ec9f

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Sep 9, 2025

fegin mentioned this pull request Sep 9, 2025

[CP] Fix the CP FlexAttention test #162518

Closed

This was referenced Sep 9, 2025

[CP][BE] Remove _AttentionContextParallel #162539

Closed

[CP] Remove the need of recording cp_dim in the global var #162540

Closed

[CP][BE] Correct the names of some tests #162541

Closed

fegin changed the title ~~[CP] Introduce flex_attention_wrapper~~ [CP][WIP] Introduce flex_attention_wrapper Sep 9, 2025

Update

013cd36

[ghstack-poisoned]

Update

b279c6e

[ghstack-poisoned]

fegin added a commit to pytorch/torchtitan that referenced this pull request Sep 10, 2025

[CP][RFC] Enable FlexCP for llama3

74b9c34

This PR requires pytorch/pytorch#162542

fegin mentioned this pull request Sep 10, 2025

[CP][RFC] Enable FlexCP for llama3 with function wrapper pytorch/torchtitan#1696

Closed

Update

9a52504

[ghstack-poisoned]

fegin added a commit to pytorch/torchtitan that referenced this pull request Sep 12, 2025

[CP][RFC] Enable FlexCP for llama3 with parallelize_module

70e5920

Similar to #1696, but this PR uses parallel_module similar to TP/SP. This PR also requires pytorch/pytorch#162542

fegin mentioned this pull request Sep 12, 2025

[CP][RFC] Enable FlexCP for llama3 with parallelize_module pytorch/torchtitan#1707

Closed

Update

2011247

[ghstack-poisoned]

fegin changed the title ~~[CP][WIP] Introduce flex_attention_wrapper~~ [CP][WIP] Introduce flex_attention_wrapper and ContextParallel parallel plan Sep 12, 2025

Update

62f5c7d

[ghstack-poisoned]

Update

9e6f912

[ghstack-poisoned]

Update

0123248

[ghstack-poisoned]

Update

9430815

[ghstack-poisoned]

pytorchmergebot added the merging label Oct 9, 2025

pytorchmergebot removed the merging label Oct 9, 2025

Update

97ac317

[ghstack-poisoned]

pytorchmergebot added the merging label Oct 10, 2025

pytorchmergebot removed the merging label Oct 10, 2025

fegin added 2 commits October 9, 2025 23:27

Update

c3ef1aa

[ghstack-poisoned]

Update

a18e4eb

[ghstack-poisoned]

pytorchmergebot added the merging label Oct 10, 2025

pytorchmergebot added the Merged label Oct 10, 2025

pytorchmergebot closed this in ee0a8a5 Oct 10, 2025

pytorchmergebot removed the merging label Oct 10, 2025

jeffdaily reviewed Oct 13, 2025

View reviewed changes

github-actions bot deleted the gh/fegin/318/head branch November 13, 2025 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CP]Introduce ContextParallal plan for parallelize_module() #162542

[CP]Introduce ContextParallal plan for parallelize_module() #162542

Uh oh!

fegin commented Sep 9, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 9, 2025 •

edited

Loading

Uh oh!

fegin commented Oct 9, 2025

Uh oh!

pytorchmergebot commented Oct 9, 2025

Uh oh!

pytorchmergebot commented Oct 9, 2025

Uh oh!

fegin commented Oct 10, 2025

Uh oh!

pytorchmergebot commented Oct 10, 2025

Uh oh!

pytorchmergebot commented Oct 10, 2025

Uh oh!

fegin commented Oct 10, 2025

Uh oh!

pytorchmergebot commented Oct 10, 2025

Uh oh!

jeffdaily Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[CP]Introduce ContextParallal plan for parallelize_module() #162542

[CP]Introduce ContextParallal plan for parallelize_module() #162542

Uh oh!

Conversation

fegin commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162542

✅ No Failures

Uh oh!

fegin commented Oct 9, 2025

Uh oh!

pytorchmergebot commented Oct 9, 2025

Merge started

Uh oh!

pytorchmergebot commented Oct 9, 2025

Merge failed

Uh oh!

fegin commented Oct 10, 2025

Uh oh!

pytorchmergebot commented Oct 10, 2025

Merge started

Uh oh!

pytorchmergebot commented Oct 10, 2025

Merge failed

Uh oh!

fegin commented Oct 10, 2025

Uh oh!

pytorchmergebot commented Oct 10, 2025

Merge started

Uh oh!

jeffdaily Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

fegin commented Sep 9, 2025 •

edited

Loading

pytorch-bot bot commented Sep 9, 2025 •

edited

Loading