[Pipelining] add schedule simulator and chrometrace dump #138134

wconstab · 2024-10-16T23:41:07Z

Stack from ghstack (oldest at bottom):

Schedule simulator is useful for detecting hangs in schedules and
validating that they won't hang. It also inserts bubbles (None actions)
at any timestep where a rank can not enqueue its next action due to
unmet dependencies, which can serve as a rough metric for schedule
efficiency. The output can be visualized. The simulator expects a full
comm + compute schedule as input.

Chrometrace dump is a basic visualization utility. It currently just
renders one 'process' per rank, and lets users visualize the schedule in
a UI instead of as text.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-10-16T23:41:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138134

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c1e5289 with merge base failed to retrieve merge base, please contact dev infra:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

H-Huang · 2024-10-18T16:26:29Z

test/distributed/pipelining/test_schedule.py

+        # _dump_chrometrace(simulated_schedule, "lowered_comms.json")
+        # print(_format_pipeline_order(simulated_schedule))
+        num_steps = max([len(simulated_schedule[rank]) for rank in simulated_schedule])
+        self.assertEqual(num_steps, 9)


how come there are 9 steps when the original schedule looks like it has 8 steps?

probably there was one bubble added, because of waiting for the first recv op or something.

H-Huang · 2024-10-18T17:00:00Z

torch/distributed/pipelining/schedules.py

+                    "pid": rank,
+                    "tid": rank,
+                    "ts": timestep,
+                    "dur": 1,


this is arbitrary, but should B have different duration? Without the deferred weight, in literature they usually had backward as x2 the forward

Yea, B should be 2x forward. this is pretty barebones for now but I will make that change. one key feature that I didn't implement is adding 'flow events' (arrows in the profile visualizer) that connect one send to another recv, just for aiding visual debugging.

Another thing I didn't implement in either the visualizer or the simulator is 'stream simulation'. I could probably do a better job of showing overlap of comms/compute but I punted on that for now.

actually B is in a limbo state in this PR. in the next PR it becomes clearer that B means 'dInput' (which would proabably be closer to '1', and 'W' means dW which would also be closer to 1. The BW op is missing, which would be closer to 2. I will fix that in the next PR instead of this one.

H-Huang · 2024-10-18T17:11:44Z

torch/distributed/pipelining/schedules.py

    return schedule_map[lowercase_keys[lowercase_schedule_name]]
+
+
+def _simulate_comms_compute(


comment on function's purpose? Do i have the correct understanding that it is intended to be used after adding comm ops in _add_send_recv to reorder the communication ops?

ill add a comment.

No, the purpose is not to reorder ops. In fact it should never reorder ops, but it may add bubbles. The 2 points for the function are (1) to determine if the schedule is expected to deadlock, (2) and assuming it runs without deadlock, simulate the logical ordering of ops per timestep. I don't require users to put 'bubbles' into their schedule definitions, but the simulator would add the bubbles in based on whether there are timesteps where no action is 'ready' to execute on the local rank given its comm dependencies.

[ghstack-poisoned]

H-Huang

thanks for updating

wconstab · 2024-10-21T22:22:12Z

@pytorchbot merge

pytorchmergebot · 2024-10-21T22:24:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-21T22:44:52Z

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-build / build

Details for Dev Infra team

Raised by workflow job

[ghstack-poisoned]

wconstab · 2024-10-24T20:31:34Z

@pytorchbot merge -i

pytorchmergebot · 2024-10-24T20:33:17Z

Merge started

Your change will be merged while ignoring the following 0 checks:

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kwen2501 · 2024-10-25T00:39:59Z

PR description?

pytorchmergebot · 2024-10-25T02:31:54Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

[ghstack-poisoned]

wconstab · 2024-10-25T17:13:38Z

@pytorchbot merge -i

pytorchmergebot · 2024-10-25T17:15:07Z

Merge started

Your change will be merged while ignoring the following 3 checks: pull / linux-focal-cuda12.4-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu), pull / linux-focal-cuda12.1-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu), pull / linux-focal-py3.12-clang10 / test (default, 1, 4, linux.4xlarge)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-25T18:18:37Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.g4dn.12xlarge.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

[ghstack-poisoned]

wconstab · 2024-10-30T19:42:26Z

@pytorchbot merge

pytorchmergebot · 2024-10-30T19:44:58Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

) Schedule simulator is useful for detecting hangs in schedules and validating that they won't hang. It also inserts bubbles (None actions) at any timestep where a rank can not enqueue its next action due to unmet dependencies, which can serve as a rough metric for schedule efficiency. The output can be visualized. The simulator expects a full comm + compute schedule as input. Chrometrace dump is a basic visualization utility. It currently just renders one 'process' per rank, and lets users visualize the schedule in a UI instead of as text. Pull Request resolved: pytorch#138134 Approved by: https://github.com/H-Huang

Update

6c9f19f

[ghstack-poisoned]

wconstab mentioned this pull request Oct 16, 2024

[Pipelining] Support separate dI / dW and V-schedules #131762

Closed

wconstab mentioned this pull request Oct 16, 2024

[pipelining] Batch send/recv between adjacent ranks #130860

Closed

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 16, 2024

wconstab mentioned this pull request Oct 16, 2024

[Pipelining] Support V-schedules in IR and Runtime #138125

Closed

wconstab added 3 commits October 16, 2024 16:51

Update

5271792

[ghstack-poisoned]

Update

be33992

[ghstack-poisoned]

Update

71dc9ab

[ghstack-poisoned]

wconstab mentioned this pull request Oct 17, 2024

[Pipelining] Fix/improve format_pipeline_order #138142

Closed

wconstab added release notes: distributed (pipeline) release notes category module: pipelining Pipeline Parallelism labels Oct 17, 2024

Update

4683ab7

[ghstack-poisoned]

This was referenced Oct 17, 2024

[Pipelining] Remove unnecessary {0,1} qualifier from regex #138271

Closed

[Pipelining] WIP- not sure why we need this fix?' #138272

Closed

Update

4c105af

[ghstack-poisoned]

wconstab requested review from H-Huang and kwen2501 October 18, 2024 00:29

Update

1b7ed1c

[ghstack-poisoned]

H-Huang reviewed Oct 18, 2024

View reviewed changes

Update

66a17bd

[ghstack-poisoned]

H-Huang approved these changes Oct 21, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 21, 2024

pytorchmergebot added the merging label Oct 21, 2024

pytorchmergebot removed the merging label Oct 21, 2024

wconstab added 2 commits October 23, 2024 15:21

Update

db844aa

[ghstack-poisoned]

Update

1b2baf5

[ghstack-poisoned]

wconstab mentioned this pull request Oct 24, 2024

[Pipelining] Make forward_outputs and loss_fn strictly typed #138779

Closed

pytorchmergebot added the merging label Oct 24, 2024

This was referenced Oct 24, 2024

[Pipelining] Relax multi-stage constraint #138862

Closed

Debug test failure for separate I, W execution #138863

Closed

wconstab mentioned this pull request Oct 25, 2024

[Pipelining] Update schedules to use I, B actions. #138886

Closed

Update

b9e1920

[ghstack-poisoned]

wconstab mentioned this pull request Oct 25, 2024

[Pipelining] Optimize ready_to_schedule logic #138924

Closed

pytorchmergebot removed the merging label Oct 25, 2024

wconstab mentioned this pull request Oct 25, 2024

[Pipelining] Remove unused special case from simulator #138928

Closed

Update

c1e5289

[ghstack-poisoned]

pytorchmergebot added the merging label Oct 30, 2024

pytorchmergebot added the Merged label Oct 30, 2024

pytorchmergebot closed this in 4a8d122 Oct 30, 2024

pytorchmergebot removed the merging label Oct 30, 2024

github-actions bot deleted the gh/wconstab/344/head branch November 30, 2024 02:09

		return schedule_map[lowercase_keys[lowercase_schedule_name]]


		def _simulate_comms_compute(

[Pipelining] add schedule simulator and chrometrace dump #138134

[Pipelining] add schedule simulator and chrometrace dump #138134

Uh oh!

Conversation

wconstab commented Oct 16, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138134

✅ No Failures

Uh oh!

H-Huang Oct 18, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab Oct 18, 2024

Choose a reason for hiding this comment

Uh oh!

H-Huang Oct 18, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab Oct 18, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab Oct 18, 2024

Choose a reason for hiding this comment

Uh oh!

H-Huang Oct 18, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab Oct 18, 2024

Choose a reason for hiding this comment

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab commented Oct 21, 2024

Uh oh!

pytorchmergebot commented Oct 21, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 21, 2024

Merge failed

Uh oh!

wconstab commented Oct 24, 2024

Uh oh!

pytorchmergebot commented Oct 24, 2024

Merge started

Uh oh!

kwen2501 commented Oct 25, 2024

Uh oh!

pytorchmergebot commented Oct 25, 2024

Uh oh!

wconstab commented Oct 25, 2024

Uh oh!

pytorchmergebot commented Oct 25, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 25, 2024

Merge failed

Uh oh!

wconstab commented Oct 30, 2024

Uh oh!

pytorchmergebot commented Oct 30, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wconstab commented Oct 16, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 16, 2024 •

edited

Loading