-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[Pipelining] add schedule simulator and chrometrace dump #138134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138134
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c1e5289 with merge base failed to retrieve merge base, please contact dev infra: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| # _dump_chrometrace(simulated_schedule, "lowered_comms.json") | ||
| # print(_format_pipeline_order(simulated_schedule)) | ||
| num_steps = max([len(simulated_schedule[rank]) for rank in simulated_schedule]) | ||
| self.assertEqual(num_steps, 9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how come there are 9 steps when the original schedule looks like it has 8 steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably there was one bubble added, because of waiting for the first recv op or something.
| "pid": rank, | ||
| "tid": rank, | ||
| "ts": timestep, | ||
| "dur": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is arbitrary, but should B have different duration? Without the deferred weight, in literature they usually had backward as x2 the forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, B should be 2x forward. this is pretty barebones for now but I will make that change. one key feature that I didn't implement is adding 'flow events' (arrows in the profile visualizer) that connect one send to another recv, just for aiding visual debugging.
Another thing I didn't implement in either the visualizer or the simulator is 'stream simulation'. I could probably do a better job of showing overlap of comms/compute but I punted on that for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually B is in a limbo state in this PR. in the next PR it becomes clearer that B means 'dInput' (which would proabably be closer to '1', and 'W' means dW which would also be closer to 1. The BW op is missing, which would be closer to 2. I will fix that in the next PR instead of this one.
| return schedule_map[lowercase_keys[lowercase_schedule_name]] | ||
|
|
||
|
|
||
| def _simulate_comms_compute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on function's purpose? Do i have the correct understanding that it is intended to be used after adding comm ops in _add_send_recv to reorder the communication ops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ill add a comment.
No, the purpose is not to reorder ops. In fact it should never reorder ops, but it may add bubbles. The 2 points for the function are (1) to determine if the schedule is expected to deadlock, (2) and assuming it runs without deadlock, simulate the logical ordering of ops per timestep. I don't require users to put 'bubbles' into their schedule definitions, but the simulator would add the bubbles in based on whether there are timesteps where no action is 'ready' to execute on the local rank given its comm dependencies.
H-Huang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for updating
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-build / build Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 0 checks: Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
PR description? |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 3 checks: pull / linux-focal-cuda12.4-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu), pull / linux-focal-cuda12.1-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu), pull / linux-focal-py3.12-clang10 / test (default, 1, 4, linux.4xlarge) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
) Schedule simulator is useful for detecting hangs in schedules and validating that they won't hang. It also inserts bubbles (None actions) at any timestep where a rank can not enqueue its next action due to unmet dependencies, which can serve as a rough metric for schedule efficiency. The output can be visualized. The simulator expects a full comm + compute schedule as input. Chrometrace dump is a basic visualization utility. It currently just renders one 'process' per rank, and lets users visualize the schedule in a UI instead of as text. Pull Request resolved: pytorch#138134 Approved by: https://github.com/H-Huang
Stack from ghstack (oldest at bottom):
Schedule simulator is useful for detecting hangs in schedules and
validating that they won't hang. It also inserts bubbles (None actions)
at any timestep where a rank can not enqueue its next action due to
unmet dependencies, which can serve as a rough metric for schedule
efficiency. The output can be visualized. The simulator expects a full
comm + compute schedule as input.
Chrometrace dump is a basic visualization utility. It currently just
renders one 'process' per rank, and lets users visualize the schedule in
a UI instead of as text.
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @d4l3k @c-p-i-o