E2E composability testing #141398

mori360 · 2024-11-22T23:22:54Z

Add 3D(pp+tp+fsdp) test test_3d_with_tp_dp_pp at test_pp_compodability
Currently provide @parametrize on
"ScheduleClass" for pp in [ScheduleGPipe, Schedule1F1B, ScheduleInterleaved1F1B, ScheduleLoopedBFS, ScheduleInterleavedZeroBubble]
"MixedPrecisionParam" for fsdp in [torch.bfloat16, torch.float32]

Future work:

add fp8
add cp(context parallelism) to enable 4D test

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

pytorch-bot · 2024-11-22T23:22:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141398

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bc5ef3d with merge base d99c9c2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab

Thanks for working on this. There are a few issues that need to be addressed, with comments inline. The main themes are

parallelisms aren't applied correctly, since the test uses MLPModel but the helpers expect llama model
the organization can be improved- combine with other composability tests, reduce copied helpers

test/distributed/_composable/test_composability/test_3d_composability.py

.ci/pytorch/multigpu-test.sh

test/distributed/_composable/test_composability/test_3d_composability.py

test/distributed/_composable/test_composability/test_pp_composability.py

wconstab · 2024-12-04T19:44:48Z

test/distributed/_composable/test_composability/test_pp_composability.py

note: not in this PR, but we might want to clean this pp init stuff up later. we should get rid of the 'single' vs 'multi' base classes and make it easier to construct a schedule without so many lines of boilerplate. Or maybe i'm wrong and single vs multi is legitimately worth having different init flows for?? cc @H-Huang

wconstab

overall LGTM. thanks for all the improvements! Can you check with @awgu and @kwen2501 also for any more combinations of pp/fsdp that are important to include in the testing?

One follow up i think is to add fp8. It can probably be an extension of this test function with one more parameterization, and in another PR.

mori360 · 2024-12-05T01:03:42Z

@pytorchbot rebase

pytorchmergebot · 2024-12-05T01:05:11Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-12-05T01:05:15Z

Successfully rebased 3dtest onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout 3dtest && git pull --rebase)

kwen2501

LGTM.

kwen2501 · 2024-12-05T18:19:42Z

test/distributed/_composable/test_composability/test_pp_composability.py

+
+        # create "entire model"
+        total_layers = 8
+        dim = 8


nit: consider using dim in your model definition?

kwen2501 · 2024-12-05T18:23:41Z

test/distributed/_composable/test_composability/test_pp_composability.py

+        super().__init__()
+        self.net1 = nn.Linear(8, 8)
+        self.net2 = nn.Linear(8, 8)
+        self.net3 = nn.Linear(8, 16)


To answer your q, here you have to use 16 because of the colwise you apply to net3.

mori360 · 2024-12-05T20:48:45Z

@pytorchbot merge

pytorchmergebot · 2024-12-05T20:50:27Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

facebook-github-bot · 2024-12-11T19:46:55Z

@mori360 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-12-12T03:27:16Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2024-12-12T03:28:54Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-12-12T03:29:13Z

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team

Raised by workflow job

clee2000 · 2024-12-12T04:17:43Z

@pytorchbot merge -f "internal diff landed"

pytorchmergebot · 2024-12-12T04:19:20Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels Nov 22, 2024

mori360 force-pushed the 3dtest branch from ec4b0ec to f88f500 Compare November 22, 2024 23:31

mori360 requested a review from wconstab November 27, 2024 18:30

mori360 marked this pull request as ready for review November 27, 2024 18:31

mori360 requested a review from a team as a code owner November 27, 2024 18:31

wconstab requested changes Nov 27, 2024

View reviewed changes

mori360 marked this pull request as draft November 27, 2024 20:52

mori360 marked this pull request as ready for review December 4, 2024 19:23

mori360 requested a review from wconstab December 4, 2024 19:23

wconstab reviewed Dec 4, 2024

View reviewed changes

test/distributed/_composable/test_composability/test_pp_composability.py Outdated Show resolved Hide resolved

wconstab reviewed Dec 4, 2024

View reviewed changes

test/distributed/_composable/test_composability/test_pp_composability.py Outdated Show resolved Hide resolved

wconstab reviewed Dec 4, 2024

View reviewed changes

wconstab approved these changes Dec 4, 2024

View reviewed changes

pytorchmergebot force-pushed the 3dtest branch from 8931de1 to cd946ad Compare December 5, 2024 01:05

mori360 marked this pull request as draft December 5, 2024 18:18

kwen2501 approved these changes Dec 5, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 5, 2024

pytorchmergebot added the merging label Dec 5, 2024

pytorchmergebot added the Merged label Dec 6, 2024

pytorchmergebot closed this in ad93aa8 Dec 6, 2024

pytorchmergebot removed the merging label Dec 6, 2024

pytorch deleted a comment from pytorch-bot bot Dec 6, 2024

mori360 added 10 commits December 10, 2024 20:10

add into ci test

5974a0d

add requirements

2efd24d

correct requirements

14f043b

lint error

0be86f6

change test setting

71482a3

lintrunner

1a73ddb

revise 3d test

1eabea0

move 3d test to pp test

171e7e6

remove ref_model

996ecdf

revert change at test_manual_with_data_parallel

bc5ef3d

mori360 force-pushed the 3dtest branch from d06c6d0 to bc5ef3d Compare December 11, 2024 04:22

mori360 removed module: cpu CPU specific problem (e.g., perf, algorithm) release notes: quantization release notes category ciflow/mps Run MPS tests (subset of trunk) module: dynamo labels Dec 11, 2024

mori360 marked this pull request as ready for review December 12, 2024 03:26

pytorchmergebot added the merging label Dec 12, 2024

pytorchmergebot removed the merging label Dec 12, 2024

pytorchmergebot added the merging label Dec 12, 2024

pytorchmergebot closed this in 4d07754 Dec 12, 2024

pytorchmergebot removed the merging label Dec 12, 2024

mori360 deleted the 3dtest branch February 6, 2025 01:11

E2E composability testing #141398

E2E composability testing #141398

Uh oh!

Conversation

mori360 commented Nov 22, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141398

✅ No Failures

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wconstab Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

mori360 commented Dec 5, 2024

Uh oh!

pytorchmergebot commented Dec 5, 2024

Uh oh!

pytorchmergebot commented Dec 5, 2024

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 Dec 5, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Dec 5, 2024

Choose a reason for hiding this comment

Uh oh!

mori360 commented Dec 5, 2024

Uh oh!

pytorchmergebot commented Dec 5, 2024

Merge started

Uh oh!

facebook-github-bot commented Dec 11, 2024

Uh oh!

facebook-github-bot commented Dec 12, 2024

Uh oh!

pytorchmergebot commented Dec 12, 2024

Merge started

Uh oh!

pytorchmergebot commented Dec 12, 2024

Merge failed

Uh oh!

clee2000 commented Dec 12, 2024

Uh oh!

pytorchmergebot commented Dec 12, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mori360 commented Nov 22, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 22, 2024 •

edited

Loading