Skip to content

Conversation

@awgu
Copy link
Collaborator

@awgu awgu commented Dec 9, 2024

Stack from ghstack (oldest at bottom):

Internal only: the before way meant that from torch.distributed._composable.fsdp import fully_shard was importing fully_shard.py not the function fully_shard. For some reason, the resolution order is different from open source.

To fix this, we match the old import as closely as possible. Namely, we import fully_shard.py contents from .fully_shard. This should force that import to take precedence.

cc @H-Huang @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@diff-train-skip-merge

Differential Revision: D66990327

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142419

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (3 Unrelated Failures)

As of commit 5aabc65 with merge base beeffe7 (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Dec 9, 2024
@awgu awgu added release notes: distributed (fsdp2) release notes category ciflow/trunk Trigger trunk jobs on your pull request and removed release notes: distributed (fsdp) release notes category labels Dec 9, 2024
cc H-Huang kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
awgu pushed a commit that referenced this pull request Dec 9, 2024
ghstack-source-id: 2c4a7af
Pull Request resolved: #142419
@awgu
Copy link
Collaborator Author

awgu commented Dec 9, 2024

@awgu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -1,8 +1,3 @@
from torch.distributed.fsdp import (
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal only: the before way meant that from torch.distributed._composable.fsdp import fully_shard was importing fully_shard.py not the function fully_shard. For some reason, the resolution order is different from open source.

To fix this, we match the old import as closely as possible. Namely, we import fully_shard.py contents from .fully_shard. This should force that import to take precedence.

@awgu
Copy link
Collaborator Author

awgu commented Dec 9, 2024

Since all distributed unit tests in pytorch repo are migrated to use the new import, this should be a safe change for the pytorch unit tests.

I checked that

pytest test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_old

passes locally still.

I will have to ninja land this from internal to ensure internal gets this fix asap. I will do so after I verify that lint finishes.

@awgu awgu marked this pull request as ready for review December 9, 2024 23:27
@awgu
Copy link
Collaborator Author

awgu commented Dec 9, 2024

Both lintrunner jobs (clang, noclang) have passed. I am going to land.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -i

(Initiating merge automatically since Phabricator Diff has merged, merging with -i because oss signals were bypassed internally)

4 similar comments
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -i

(Initiating merge automatically since Phabricator Diff has merged, merging with -i because oss signals were bypassed internally)

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -i

(Initiating merge automatically since Phabricator Diff has merged, merging with -i because oss signals were bypassed internally)

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -i

(Initiating merge automatically since Phabricator Diff has merged, merging with -i because oss signals were bypassed internally)

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -i

(Initiating merge automatically since Phabricator Diff has merged, merging with -i because oss signals were bypassed internally)

@clee2000
Copy link
Contributor

clee2000 commented Dec 9, 2024

@pytorchbot merge -f "fb bot is spamming again, landed internally"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions bot deleted the gh/awgu/662/head branch January 11, 2025 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp2) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants