Skip to content

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Oct 25, 2024

Stack from ghstack (oldest at bottom):

Used in both simulator and add_send_recv pass, the ready_to_schedule
logic works by looking at all the previously scheduled ops on a rank to
see if any of them 'unblocks' the current op to be scheduled. For example,
to schedule a FORWARD op, a previous RECV_F op is needed, unless this is
stage 0 or there is a previous stage on the same rank that ran FORWARD
already.

The old implementation iteratively compared the candidate op to the
previous ops. The new implementation uses set lookups to reduce
complexity. It also maintains the set of previous ops as ops are
scheduled rather than constructing a set on demand.

I did not save benchmark results, but this results in a 10-100x speedup
which is most noticeable for unit tests with artificially huge schedule
IR, the largest of which took longer than 20m before (I never let it
finish) but now takes less than 14s. Most schedules take less than
10ms.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @d4l3k @c-p-i-o

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 25, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138924

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6de4be9 with merge base failed to retrieve merge base, please contact dev infra:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
@wconstab wconstab changed the title [Pipelining] Optimize ready_to_schedule logic [Pipelining] Optimize ready_to_schedule logic Oct 25, 2024
[ghstack-poisoned]
@wconstab wconstab added release notes: distributed (pipeline) release notes category module: pipelining Pipeline Parallelism labels Oct 30, 2024
@wconstab wconstab requested review from H-Huang and kwen2501 October 30, 2024 19:44
[ghstack-poisoned]
[ghstack-poisoned]
Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10-100x speedup

Wow nice!!

rank = stage_to_rank(stage_idx)
ops = copy.deepcopy(schedule[rank])
return ops
_prev_ops_rank: Dict[int, Set[_Action]] = {rank: set() for rank in _schedule}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah smart idea to use set because each action is unique / hashable.

[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@wconstab
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 31, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@wconstab
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
Used in both simulator and add_send_recv pass, the ready_to_schedule
logic works by looking at all the previously scheduled ops on a rank to
see if any of them 'unblocks' the current op to be scheduled.  For example,
to schedule a FORWARD op, a previous RECV_F op is needed, unless this is
stage 0 or there is a previous stage on the same rank that ran FORWARD
already.

The old implementation iteratively compared the candidate op to the
previous ops.  The new implementation uses set lookups to reduce
complexity.  It also maintains the set of previous ops as ops are
scheduled rather than constructing a set on demand.

I did not save benchmark results, but this results in a 10-100x speedup
which is most noticeable for unit tests with artificially huge schedule
IR, the largest of which took longer than 20m before (I never let it
finish) but now takes less than 14s.  Most schedules take less than
10ms.

Pull Request resolved: pytorch#138924
Approved by: https://github.com/H-Huang
ghstack dependencies: pytorch#138928, pytorch#131762
@github-actions github-actions bot deleted the gh/wconstab/352/head branch December 1, 2024 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: pipelining Pipeline Parallelism oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (pipeline) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants