Skip to content

Conversation

@mori360
Copy link
Contributor

@mori360 mori360 commented Nov 22, 2024

Add 3D(pp+tp+fsdp) test test_3d_with_tp_dp_pp at test_pp_compodability
Currently provide @parametrize on
"ScheduleClass" for pp in [ScheduleGPipe, Schedule1F1B, ScheduleInterleaved1F1B, ScheduleLoopedBFS, ScheduleInterleavedZeroBubble]
"MixedPrecisionParam" for fsdp in [torch.bfloat16, torch.float32]

Future work:

  1. add fp8
  2. add cp(context parallelism) to enable 4D test

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 22, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141398

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bc5ef3d with merge base d99c9c2 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels Nov 22, 2024
@mori360 mori360 requested a review from wconstab November 27, 2024 18:30
@mori360 mori360 marked this pull request as ready for review November 27, 2024 18:31
@mori360 mori360 requested a review from a team as a code owner November 27, 2024 18:31
Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. There are a few issues that need to be addressed, with comments inline. The main themes are

  • parallelisms aren't applied correctly, since the test uses MLPModel but the helpers expect llama model
  • the organization can be improved- combine with other composability tests, reduce copied helpers

@mori360 mori360 marked this pull request as draft November 27, 2024 20:52
@mori360 mori360 marked this pull request as ready for review December 4, 2024 19:23
@mori360 mori360 requested a review from wconstab December 4, 2024 19:23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: not in this PR, but we might want to clean this pp init stuff up later. we should get rid of the 'single' vs 'multi' base classes and make it easier to construct a schedule without so many lines of boilerplate. Or maybe i'm wrong and single vs multi is legitimately worth having different init flows for?? cc @H-Huang

Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM. thanks for all the improvements! Can you check with @awgu and @kwen2501 also for any more combinations of pp/fsdp that are important to include in the testing?

One follow up i think is to add fp8. It can probably be an extension of this test function with one more parameterization, and in another PR.

@mori360
Copy link
Contributor Author

mori360 commented Dec 5, 2024

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased 3dtest onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout 3dtest && git pull --rebase)

Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.


# create "entire model"
total_layers = 8
dim = 8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider using dim in your model definition?

super().__init__()
self.net1 = nn.Linear(8, 8)
self.net2 = nn.Linear(8, 8)
self.net3 = nn.Linear(8, 16)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer your q, here you have to use 16 because of the colwise you apply to net3.

@mori360
Copy link
Contributor Author

mori360 commented Dec 5, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 5, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@mori360 mori360 removed module: cpu CPU specific problem (e.g., perf, algorithm) release notes: quantization release notes category ciflow/mps Run MPS tests (subset of trunk) module: dynamo labels Dec 11, 2024
@facebook-github-bot
Copy link
Contributor

@mori360 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mori360 mori360 marked this pull request as ready for review December 12, 2024 03:26
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team Raised by workflow job

@clee2000
Copy link
Contributor

@pytorchbot merge -f "internal diff landed"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@mori360 mori360 deleted the 3dtest branch February 6, 2025 01:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants