Skip to content

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Oct 16, 2024

Stack from ghstack (oldest at bottom):

Schedule simulator is useful for detecting hangs in schedules and
validating that they won't hang. It also inserts bubbles (None actions)
at any timestep where a rank can not enqueue its next action due to
unmet dependencies, which can serve as a rough metric for schedule
efficiency. The output can be visualized. The simulator expects a full
comm + compute schedule as input.

Chrometrace dump is a basic visualization utility. It currently just
renders one 'process' per rank, and lets users visualize the schedule in
a UI instead of as text.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @d4l3k @c-p-i-o

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 16, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138134

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c1e5289 with merge base failed to retrieve merge base, please contact dev infra:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 16, 2024
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@wconstab wconstab requested review from H-Huang and kwen2501 October 18, 2024 00:29
[ghstack-poisoned]
# _dump_chrometrace(simulated_schedule, "lowered_comms.json")
# print(_format_pipeline_order(simulated_schedule))
num_steps = max([len(simulated_schedule[rank]) for rank in simulated_schedule])
self.assertEqual(num_steps, 9)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come there are 9 steps when the original schedule looks like it has 8 steps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably there was one bubble added, because of waiting for the first recv op or something.

"pid": rank,
"tid": rank,
"ts": timestep,
"dur": 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is arbitrary, but should B have different duration? Without the deferred weight, in literature they usually had backward as x2 the forward

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, B should be 2x forward. this is pretty barebones for now but I will make that change. one key feature that I didn't implement is adding 'flow events' (arrows in the profile visualizer) that connect one send to another recv, just for aiding visual debugging.

Another thing I didn't implement in either the visualizer or the simulator is 'stream simulation'. I could probably do a better job of showing overlap of comms/compute but I punted on that for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually B is in a limbo state in this PR. in the next PR it becomes clearer that B means 'dInput' (which would proabably be closer to '1', and 'W' means dW which would also be closer to 1. The BW op is missing, which would be closer to 2. I will fix that in the next PR instead of this one.

return schedule_map[lowercase_keys[lowercase_schedule_name]]


def _simulate_comms_compute(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on function's purpose? Do i have the correct understanding that it is intended to be used after adding comm ops in _add_send_recv to reorder the communication ops?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ill add a comment.

No, the purpose is not to reorder ops. In fact it should never reorder ops, but it may add bubbles. The 2 points for the function are (1) to determine if the schedule is expected to deadlock, (2) and assuming it runs without deadlock, simulate the logical ordering of ops per timestep. I don't require users to put 'bubbles' into their schedule definitions, but the simulator would add the bubbles in based on whether there are timesteps where no action is 'ready' to execute on the local rank given its comm dependencies.

[ghstack-poisoned]
Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updating

@wconstab
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 21, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-build / build

Details for Dev Infra team Raised by workflow job

[ghstack-poisoned]
[ghstack-poisoned]
@wconstab
Copy link
Contributor Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 0 checks:

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@kwen2501
Copy link
Collaborator

PR description?

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

[ghstack-poisoned]
@wconstab
Copy link
Contributor Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 3 checks: pull / linux-focal-cuda12.4-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu), pull / linux-focal-cuda12.1-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu), pull / linux-focal-py3.12-clang10 / test (default, 1, 4, linux.4xlarge)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

[ghstack-poisoned]
@wconstab
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
)

Schedule simulator is useful for detecting hangs in schedules and
validating that they won't hang.  It also inserts bubbles (None actions)
at any timestep where a rank can not enqueue its next action due to
unmet dependencies, which can serve as a rough metric for schedule
efficiency.  The output can be visualized.  The simulator expects a full
comm + compute schedule as input.

Chrometrace dump is a basic visualization utility.  It currently just
renders one 'process' per rank, and lets users visualize the schedule in
a UI instead of as text.

Pull Request resolved: pytorch#138134
Approved by: https://github.com/H-Huang
@github-actions github-actions bot deleted the gh/wconstab/344/head branch November 30, 2024 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: pipelining Pipeline Parallelism oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (pipeline) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants