Skip to content

Conversation

@ankurneog
Copy link

@ankurneog ankurneog commented Aug 12, 2024

Motivation

The FSDP common code for FSDP UT execution is mostly written with cuda device in mind. However other devices such the intel Gaudi supports most of the functionality. We are generalizing the base content so that the UT content can be used for non-cuda device execution.

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 12, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133209

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 55cdd62 with merge base 76b044d (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@albanD albanD requested a review from awgu August 12, 2024 18:24
@albanD albanD added oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Aug 12, 2024
@ankurneog
Copy link
Author

@wconstab , @albanD : can you please help with the review . Thanks

@ankurneog ankurneog force-pushed the fsdp_common_changes branch from 4b7e235 to c12afd5 Compare August 15, 2024 07:10
@pytorch-bot pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Aug 15, 2024
@albanD
Copy link
Collaborator

albanD commented Aug 15, 2024

@wconstab who would have some bandwidth to do these FSDP update to make it work with non-cuda devices?

@ankurneog
Copy link
Author

ankurneog commented Aug 20, 2024

@wconstab : Can you please help with the review and approval, Main changes here : https://github.com/pytorch/pytorch/pull/133209/files#diff-7b5c66f99161fa6a3d9042e80f8c8cc140a64e43445feede46f55e53154f6c3d (torch/testing/_internal/common_fsdp.py), others are mostly for adapting to change made in that file

@ankurneog
Copy link
Author

ankurneog commented Sep 6, 2024

@albanD /@wconstab : could you please help with the review and approval. Note that we have verified this for both intel gaudi and cuda. the corresponding test module changes are included here : #135242
thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, why the difference?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, please help us unify code paths to reduce maintenance load.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankurneog this

Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM. a couple of small questions and then i would stamp.

Also, fwiw you may want to look into FSDP2 support as well, if you haven't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo- is this actually unchanged in the new formula?

as long as TEST_CUDA := torch.cuda.is_available(), then DEVICE_COUNT := torch.cuda.device_count() so this appears equivalent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a behavior change? were we relying on the 'set_device' for cuda to cover this before and now we make it explicit?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankurneog this

@ankurneog ankurneog changed the title Generalization of FSDP common for non-cuda UT execution Generalization of FSDP common for non-cuda execution Sep 9, 2024
@ankurneog
Copy link
Author

ankurneog commented Sep 9, 2024

@wconstab : thank you for your review comments, I have addressed them and also replied to your queries. could you please have a look and help with the approval. thank you.

@ankurneog
Copy link
Author

@wconstab : gentle reminder , could you kindly do the needful, thanks.

@ankurneog
Copy link
Author

@wconstab / @fegin : could you please help with the approval , we have another PR: #135242 dependent on this. Thank you

@kwen2501
Copy link
Collaborator

Nice effort.
Curious -- does anyone know the status of FSDP's device support now? @awgu
Or is the plan to first improve the tests then the library? @ankurneog

@fegin fegin added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Sep 17, 2024
@ankurneog
Copy link
Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fsdp_common_changes onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fsdp_common_changes && git pull --rebase)

@ankurneog ankurneog force-pushed the fsdp_common_changes branch 2 times, most recently from d05c785 to ee5db28 Compare September 18, 2024 08:33
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@ankurneog
Copy link
Author

@albanD , @kwen2501 : Could you please help with the merge . Thank you

@ankurneog
Copy link
Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fsdp_common_changes onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fsdp_common_changes && git pull --rebase)

@ankurneog
Copy link
Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fsdp_common_changes onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fsdp_common_changes && git pull --rebase)

@ankurneog
Copy link
Author

@albanD , @kwen2501 : Could you please help with the merge . The failures look unrelated.

@ankurneog
Copy link
Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

if TEST_HPU:
time.sleep(self.delay_before_reduction_ms / 1000)
elif TEST_CUDA:
torch.cuda._sleep(int(self.delay_after_loss_ms * get_cycles_per_ms()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is HPU sleeping based on delay_before_reduction_ms instead of delay_after_loss_ms?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one I just added is new

@awgu
Copy link
Collaborator

awgu commented Sep 27, 2024

I am kind of confused why some of the comments left by previous reviewers were left unaddressed or without response.

@ankurneog ankurneog deleted the fsdp_common_changes branch September 27, 2024 02:54
@ankurneog
Copy link
Author

I am kind of confused why some of the comments left by previous reviewers were left unaddressed or without response.

Could you please let me know what needs to be addressed?

pytorchmergebot pushed a commit that referenced this pull request Dec 11, 2024
Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Depedency : #133209  Merged now.
There was a #135242  for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments.
@kwen2501  @zeshengzong Please review the changes.

Pull Request resolved: #139184
Approved by: https://github.com/kwen2501

Co-authored-by: Yu, Guangye <[email protected]>
pytorchmergebot pushed a commit that referenced this pull request Jan 7, 2025
Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Depedency : #133209  Merged now.
There was a #135242  for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments.
@kwen2501  @zeshengzong Please review the changes.

Pull Request resolved: #139184
Approved by: https://github.com/kwen2501

Co-authored-by: Yu, Guangye <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (fsdp) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants