Generalization of FSDP common for non-cuda execution #133209

ankurneog · 2024-08-12T09:18:28Z

Motivation

The FSDP common code for FSDP UT execution is mostly written with cuda device in mind. However other devices such the intel Gaudi supports most of the functionality. We are generalizing the base content so that the UT content can be used for non-cuda device execution.

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

pytorch-bot · 2024-08-12T09:18:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133209

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 55cdd62 with merge base 76b044d ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, lf.linux.2xlarge) (gh) (disabled by #134600, #132861, #132862 but the issue was closed recently and a rebase is needed to make it pass)
test_transformers.py::TestSDPAPrivateUse1Only::test_fused_sdp_choice_privateuseone
pull / linux-jammy-py3.10-clang15-asan / test (default, 2, 6, lf.linux.4xlarge) (gh) (disabled by #134600, #132861, #132862 but the issue was closed recently and a rebase is needed to make it pass)
test_transformers.py::TestSDPAPrivateUse1Only::test_fused_sdp_choice_privateuseone

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ankurneog · 2024-08-14T00:45:41Z

@wconstab , @albanD : can you please help with the review . Thanks

albanD · 2024-08-15T14:16:43Z

@wconstab who would have some bandwidth to do these FSDP update to make it work with non-cuda devices?

ankurneog · 2024-08-20T08:43:14Z

@wconstab : Can you please help with the review and approval, Main changes here : https://github.com/pytorch/pytorch/pull/133209/files#diff-7b5c66f99161fa6a3d9042e80f8c8cc140a64e43445feede46f55e53154f6c3d (torch/testing/_internal/common_fsdp.py), others are mostly for adapting to change made in that file

ankurneog · 2024-09-06T04:34:20Z

@albanD /@wconstab : could you please help with the review and approval. Note that we have verified this for both intel gaudi and cuda. the corresponding test module changes are included here : #135242
thanks

torch/testing/_internal/common_fsdp.py

wconstab · 2024-09-06T23:32:44Z

torch/testing/_internal/common_fsdp.py

just curious, why the difference?

+1, please help us unify code paths to reduce maintenance load.

@ankurneog this

wconstab

mostly LGTM. a couple of small questions and then i would stamp.

Also, fwiw you may want to look into FSDP2 support as well, if you haven't.

wconstab · 2024-09-06T23:34:29Z

torch/testing/_internal/common_fsdp.py

~~todo- is this actually unchanged in the new formula?~~

as long as TEST_CUDA := torch.cuda.is_available(), then DEVICE_COUNT := torch.cuda.device_count() so this appears equivalent.

torch/testing/_internal/common_fsdp.py

wconstab · 2024-09-06T23:36:35Z

torch/testing/_internal/common_fsdp.py

is this a behavior change? were we relying on the 'set_device' for cuda to cover this before and now we make it explicit?

@ankurneog this

ankurneog · 2024-09-09T05:03:45Z

@wconstab : thank you for your review comments, I have addressed them and also replied to your queries. could you please have a look and help with the approval. thank you.

ankurneog · 2024-09-12T02:42:46Z

@wconstab : gentle reminder , could you kindly do the needful, thanks.

ankurneog · 2024-09-17T02:53:00Z

@wconstab / @fegin : could you please help with the approval , we have another PR: #135242 dependent on this. Thank you

kwen2501 · 2024-09-17T06:29:10Z

Nice effort.
Curious -- does anyone know the status of FSDP's device support now? @awgu
Or is the plan to first improve the tests then the library? @ankurneog

ankurneog · 2024-09-18T05:24:01Z

@pytorchbot rebase

pytorchmergebot · 2024-09-18T05:25:34Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-09-18T05:25:37Z

Successfully rebased fsdp_common_changes onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fsdp_common_changes && git pull --rebase)

pytorchmergebot · 2024-09-24T23:25:33Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

ankurneog · 2024-09-24T23:52:41Z

@albanD , @kwen2501 : Could you please help with the merge . Thank you

ankurneog · 2024-09-25T07:30:26Z

@pytorchbot rebase

pytorchmergebot · 2024-09-25T07:32:02Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-09-25T07:32:05Z

Successfully rebased fsdp_common_changes onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fsdp_common_changes && git pull --rebase)

ankurneog · 2024-09-26T10:43:47Z

@pytorchbot rebase

pytorchmergebot · 2024-09-26T10:45:21Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-09-26T10:45:25Z

Successfully rebased fsdp_common_changes onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fsdp_common_changes && git pull --rebase)

ankurneog · 2024-09-26T14:48:30Z

@albanD , @kwen2501 : Could you please help with the merge . The failures look unrelated.

ankurneog · 2024-09-27T00:31:10Z

@pytorchbot merge

pytorchmergebot · 2024-09-27T00:32:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

awgu · 2024-09-27T00:44:09Z

torch/testing/_internal/common_fsdp.py

+            if TEST_HPU:
+                time.sleep(self.delay_before_reduction_ms / 1000)
+            elif TEST_CUDA:
+                torch.cuda._sleep(int(self.delay_after_loss_ms * get_cycles_per_ms()))


why is HPU sleeping based on delay_before_reduction_ms instead of delay_after_loss_ms?

this one I just added is new

awgu · 2024-09-27T00:45:23Z

I am kind of confused why some of the comments left by previous reviewers were left unaddressed or without response.

ankurneog · 2024-09-27T03:08:51Z

I am kind of confused why some of the comments left by previous reviewers were left unaddressed or without response.

Could you please let me know what needs to be addressed?

@kwen2501

Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices. Depedency : #133209 Merged now. There was a #135242 for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments. @kwen2501 @zeshengzong Please review the changes. Pull Request resolved: #139184 Approved by: https://github.com/kwen2501 Co-authored-by: Yu, Guangye <[email protected]>

@kwen2501

Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices. Depedency : #133209 Merged now. There was a #135242 for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments. @kwen2501 @zeshengzong Please review the changes. Pull Request resolved: #139184 Approved by: https://github.com/kwen2501 Co-authored-by: Yu, Guangye <[email protected]>

pytorchbot added the open source label Aug 12, 2024

albanD requested a review from awgu August 12, 2024 18:24

albanD added oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Aug 12, 2024

ankurneog force-pushed the fsdp_common_changes branch from 4b7e235 to c12afd5 Compare August 15, 2024 07:10

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Aug 15, 2024

rahulsingh-intel mentioned this pull request Sep 5, 2024

Tests Generelization for multiple accelerator devices #135242

Closed

wconstab reviewed Sep 6, 2024

View reviewed changes

torch/testing/_internal/common_fsdp.py Outdated Show resolved Hide resolved

wconstab reviewed Sep 6, 2024

View reviewed changes

ankurneog force-pushed the fsdp_common_changes branch from c12afd5 to a8d9bee Compare September 9, 2024 04:43

ankurneog changed the title ~~Generalization of FSDP common for non-cuda UT execution~~ Generalization of FSDP common for non-cuda execution Sep 9, 2024

fegin added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Sep 17, 2024

ankurneog force-pushed the fsdp_common_changes branch from a8d9bee to d94ed2e Compare September 18, 2024 02:31

pytorchmergebot force-pushed the fsdp_common_changes branch from d94ed2e to ab54f83 Compare September 18, 2024 05:25

ankurneog force-pushed the fsdp_common_changes branch 2 times, most recently from d05c785 to ee5db28 Compare September 18, 2024 08:33

pytorchmergebot removed the merging label Sep 24, 2024

pytorchmergebot force-pushed the fsdp_common_changes branch from 96817f5 to 7ec516d Compare September 25, 2024 07:32

Generalization of FSDP common for non-cuda UT execution

55cdd62

pytorchmergebot force-pushed the fsdp_common_changes branch from 7ec516d to 55cdd62 Compare September 26, 2024 10:45

pytorchmergebot added the merging label Sep 27, 2024

pytorchmergebot added the Merged label Sep 27, 2024

pytorchmergebot closed this in 22a4129 Sep 27, 2024

pytorchmergebot removed the merging label Sep 27, 2024

awgu reviewed Sep 27, 2024

View reviewed changes

ankurneog deleted the fsdp_common_changes branch September 27, 2024 02:54

rahulsingh-intel mentioned this pull request Oct 29, 2024

Tests Generelization for multiple accelerator devices #139184

Closed

ankurneog mentioned this pull request Mar 5, 2025

[RFC] Generalize pytorch content for non-native device execution pytorch/rfcs#66

Open

harikodali mentioned this pull request Jul 23, 2025

[RFC] Refactor test_c10d_nccl.py to Support Backend-Agnostic Testing for PyTorch #158911

Open

Generalization of FSDP common for non-cuda execution #133209

Generalization of FSDP common for non-cuda execution #133209

Uh oh!

Conversation

ankurneog commented Aug 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

pytorch-bot bot commented Aug 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133209

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

ankurneog commented Aug 14, 2024

Uh oh!

albanD commented Aug 15, 2024

Uh oh!

ankurneog commented Aug 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ankurneog commented Sep 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankurneog commented Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ankurneog commented Sep 12, 2024

Uh oh!

ankurneog commented Sep 17, 2024

Uh oh!

kwen2501 commented Sep 17, 2024

Uh oh!

ankurneog commented Sep 18, 2024

Uh oh!

pytorchmergebot commented Sep 18, 2024

Uh oh!

pytorchmergebot commented Sep 18, 2024

Uh oh!

pytorchmergebot commented Sep 24, 2024

Merge failed

Uh oh!

ankurneog commented Sep 24, 2024

Uh oh!

ankurneog commented Sep 25, 2024

Uh oh!

pytorchmergebot commented Sep 25, 2024

Uh oh!

pytorchmergebot commented Sep 25, 2024

Uh oh!

ankurneog commented Sep 26, 2024

Uh oh!

pytorchmergebot commented Sep 26, 2024

Uh oh!

pytorchmergebot commented Sep 26, 2024

Uh oh!

ankurneog commented Sep 26, 2024

Uh oh!

ankurneog commented Sep 27, 2024

Uh oh!

pytorchmergebot commented Sep 27, 2024

Merge started

Uh oh!

ankurneog commented Aug 12, 2024 •

edited

Loading

pytorch-bot bot commented Aug 12, 2024 •

edited

Loading

ankurneog commented Aug 20, 2024 •

edited

Loading

ankurneog commented Sep 6, 2024 •

edited

Loading

ankurneog commented Sep 9, 2024 •

edited

Loading