-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[DCP][OSS] Rank local checkpointing in DCP without collectives #147758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147758
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 147fb0e with merge base 8eee08d ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
3521908 to
dbc79b6
Compare
dbc79b6 to
1f25f09
Compare
36ee9c7 to
eb8d1a7
Compare
eb8d1a7 to
1329b96
Compare
1329b96 to
4d5cde1
Compare
4d5cde1 to
54c8484
Compare
78dcad0 to
d3c31ff
Compare
|
This pull request was exported from Phabricator. Differential Revision: D70112642 |
d3c31ff to
f62e067
Compare
|
This pull request was exported from Phabricator. Differential Revision: D70112642 |
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Reviewed By: meetv18 Differential Revision: D70112642
f13c835 to
83e249c
Compare
…ch#147758) Summary: X-link: meta-pytorch/tnt#991 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Test Plan: E2E UTs Save and load test with internal DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-lv5d7qcfmnqzkd Save and load test with OSS DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-z1vz46vkkgtcld https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-njvvbn07rv5ckd Reviewed By: meetv18 Differential Revision: D70112642
|
This pull request was exported from Phabricator. Differential Revision: D70112642 |
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Reviewed By: meetv18 Differential Revision: D70112642
…ch#147758) Summary: X-link: meta-pytorch/tnt#991 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Test Plan: E2E UTs Save and load test with internal DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-lv5d7qcfmnqzkd Save and load test with OSS DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-z1vz46vkkgtcld https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-njvvbn07rv5ckd Reviewed By: meetv18 Differential Revision: D70112642
83e249c to
f6b7c80
Compare
|
This pull request was exported from Phabricator. Differential Revision: D70112642 |
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Reviewed By: meetv18 Differential Revision: D70112642
…ch#147758) Summary: X-link: meta-pytorch/tnt#991 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Test Plan: E2E UTs Save and load test with internal DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-lv5d7qcfmnqzkd Save and load test with OSS DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-z1vz46vkkgtcld https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-njvvbn07rv5ckd Reviewed By: meetv18 Differential Revision: D70112642
f6b7c80 to
147fb0e
Compare
|
This pull request was exported from Phabricator. Differential Revision: D70112642 |
Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Reviewed By: meetv18 Differential Revision: D70112642
|
@pytorchmergebot merge |
Merge failedReason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR! Details for Dev Infra teamRaised by workflow job |
Summary: Pull Request resolved: #991 X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Reviewed By: meetv18 Differential Revision: D70112642 fbshipit-source-id: 5558a1d2440e539f87a9b7b6295b4199fb4b448a
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642 Pull Request resolved: #147758 Approved by: https://github.com/meetv18
Summary: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642 Pull Request resolved: #147758 Approved by: https://github.com/meetv18
…ch#147758) Summary: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642 Pull Request resolved: pytorch#147758 Approved by: https://github.com/meetv18
…ch#147758) Summary: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642 Pull Request resolved: pytorch#147758 Approved by: https://github.com/meetv18
Summary:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.
Differential Revision: D70112642
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @LucasLLC @pradeepfn @kwen2501 @c-p-i-o @MeetVadakkanchery @mhorowitz @ekr0