[FSDP2] Fixed 2D mismatched grad placements #136237

awgu · 2024-09-17T23:49:29Z

Stack from ghstack (oldest at bottom):

CUDA_VISIBLE_DEVICES=2,3,6,7 pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_train_parity_2d_transformer

cc @XilunWu @H-Huang @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Differential Revision: D62964658

[ghstack-poisoned]

pytorch-bot · 2024-09-17T23:49:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136237

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit eaf893c with merge base c64ae60 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

awgu · 2024-09-17T23:59:55Z

torch/distributed/_composable/fsdp/_fsdp_param.py

            if isinstance(grad, AsyncCollectiveTensor):
                grad = grad.wait()
            assert isinstance(grad, DTensor), f"{type(grad)}"
-            if any(pl.is_partial() for pl in grad.placements):


Previously, we changed any partial placements to replicate mainly targeting the case where replicated RMSNorm.weight had partial gradients, and we needed to trigger the all-reduce by converting from partial to replicated.

However, for the pos_embeddings in our toy Transformer class, we have Shard(0) placement and still Partial grad. We do not want to convert it to Replicate but rather to Shard(0).

I think that size the viewing into reduce-scatter output uses torch.as_strided, we were silently handling a larger/replicated gradient for pos_embeddings.weight.

awgu · 2024-09-18T00:01:41Z

cc: @mori360 do you know if I need to make any changes to CI files to have the added test run in CI? We prefer to run it with 4 GPUs.

mori360 · 2024-09-18T00:15:28Z

do you know if I need to make any changes to CI files to have the added test run in CI? We prefer to run it with 4 GPUs.

The 2d_composability file is tested under multigpu-test.sh, we don't need to add changes in CI file.

awgu · 2024-09-18T00:49:07Z

torch/distributed/_composable/fsdp/_fsdp_param.py

-                placements = [
-                    Replicate() if pl.is_partial() else pl for pl in grad.placements
-                ]
+            placements = self._tp_spec.placements


cc: @tianyu-l on this change just as a heads up

tianyu-l · 2024-09-18T05:43:01Z

test/distributed/_composable/test_composability/test_2d_composability.py

+            for ref_param, (param_name, param) in zip(
+                ref_model.parameters(), model.named_parameters()
+            ):
+                full_grad = param.grad.full_tensor()


Should we also assert param.grad is sharded here? The test seems not differentiating Replicate vs Shard

Sounds good!

I added a check specifically for pos_embeddings.weight and its gradient's TP placement because that was the parameter that exercised the bug. It is hard to assert param.grad is sharded generally since we have to case on the parallelize plan (e.g. norm weights are not sharded).

``` CUDA_VISIBLE_DEVICES=2,3,6,7 pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_train_parity_2d_transformer ``` cc XilunWu H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

awgu · 2024-09-18T15:20:23Z

@awgu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

awgu · 2024-09-19T14:27:55Z

@pytorchbot merge

pytorchmergebot · 2024-09-19T14:29:38Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

``` CUDA_VISIBLE_DEVICES=2,3,6,7 pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_train_parity_2d_transformer ``` Differential Revision: [D62964658](https://our.internmc.facebook.com/intern/diff/D62964658) Pull Request resolved: pytorch#136237 Approved by: https://github.com/weifengpy

[FSDP2] Fixed 2D mismatched grad placements

32e239f

[ghstack-poisoned]

awgu mentioned this pull request Sep 17, 2024

[FSDP2] Added shard_placement_fn arg #136221

Closed

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Sep 17, 2024

awgu commented Sep 17, 2024

View reviewed changes

awgu added release notes: distributed (fsdp2) release notes category and removed release notes: distributed (fsdp) release notes category labels Sep 18, 2024

awgu marked this pull request as ready for review September 18, 2024 00:01

awgu commented Sep 18, 2024

View reviewed changes

awgu requested review from tianyu-l, weifengpy, wz337 and yifuwang September 18, 2024 02:04

tianyu-l reviewed Sep 18, 2024

View reviewed changes

weifengpy approved these changes Sep 19, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 19, 2024

pytorchmergebot added the merging label Sep 19, 2024

pytorchmergebot closed this in 65df26f Sep 19, 2024

pytorchmergebot added Merged and removed merging labels Sep 19, 2024

github-actions bot deleted the gh/awgu/641/head branch October 20, 2024 02:09

[FSDP2] Fixed 2D mismatched grad placements #136237

[FSDP2] Fixed 2D mismatched grad placements #136237

Uh oh!

Conversation

awgu commented Sep 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136237

✅ No Failures

Uh oh!

awgu Sep 17, 2024

Choose a reason for hiding this comment

Uh oh!

awgu Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

awgu commented Sep 18, 2024

Uh oh!

mori360 commented Sep 18, 2024

Uh oh!

awgu Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

tianyu-l Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

awgu Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

awgu Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

awgu commented Sep 18, 2024

Uh oh!

awgu commented Sep 19, 2024

Uh oh!

pytorchmergebot commented Sep 19, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

awgu commented Sep 17, 2024 •

edited

Loading

pytorch-bot bot commented Sep 17, 2024 •

edited

Loading