[c10d] Make allgather as a custom op#79669
[c10d] Make allgather as a custom op#79669alanwaketan wants to merge 3 commits intogh/alanwaketan/35/basefrom
Conversation
Summary: This patch makes allgather as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_allgather_ops python test/distributed/test_c10d_gloo.py -k test_allgather_basics ...and other existing distributed tests. [ghstack-poisoned]
🔗 Helpful links
❌ 1 New FailuresAs of commit 075cca0 (more details on the Dr. CI page): Expand to see more
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages
|
Summary: This patch makes allgather as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_allgather_ops python test/distributed/test_c10d_gloo.py -k test_allgather_basics ...and other existing distributed tests. [ghstack-poisoned]
Summary: This patch makes allgather as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_allgather_ops python test/distributed/test_c10d_gloo.py -k test_allgather_basics ...and other existing distributed tests. ghstack-source-id: 98f6ccb Pull Request resolved: #79669
Summary: This patch makes allgather as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_allgather_ops python test/distributed/test_c10d_gloo.py -k test_allgather_basics ...and other existing distributed tests. [ghstack-poisoned]
Summary: This patch makes allgather as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_allgather_ops python test/distributed/test_c10d_gloo.py -k test_allgather_basics ...and other existing distributed tests. ghstack-source-id: 8676b05 Pull Request resolved: #79669
mrshenli
left a comment
There was a problem hiding this comment.
LGTM! I'd vote for let's some tests to cover both forward and backward (i.e., torch.distributed.nn.*). It does not have to be in this PR, can be a PR on top. Just wanna verify we didn't miss anything. :)
| const c10::intrusive_ptr<ProcessGroup>& process_group, | ||
| int64_t timeout) { | ||
| return process_group->allgather( | ||
| const_cast<std::vector<std::vector<at::Tensor>>&>(output_tensors), |
There was a problem hiding this comment.
We should be able to modify allgather to take const std::vector as well if necessary. Shall we create an issue to track that?
There was a problem hiding this comment.
Yea, I agreed. PG's should accept const vectors& instead of vectors& given what we do is to update the content of the values within the vectors and not the vectors itself. Let me create an issue. #80241.
|
Thanks Shen for approving this pull request. The CI failure doesn't seem to be related. |
|
@pytorchbot merge -f |
|
@pytorchbot successfully started a merge job. Check the current status here |
|
Hey @alanwaketan. |
Sounds good. Opened #80242 to keep track of this. |
Summary: This patch makes allgather as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Pull Request resolved: #79669 Approved by: https://github.com/mrshenli Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/3359af93909bb6009659e01e81eabb9c8dbb4b3a Test plan from GitHub: python test/distributed/test_c10d_nccl.py -k test_allgather_ops python test/distributed/test_c10d_gloo.py -k test_allgather_basics ...and other existing distributed tests. Reviewed By: atalman Differential Revision: D37455652 Pulled By: alanwaketan fbshipit-source-id: 058691c6f5a59807b585a22a93353e91313f8b6f
Stack from ghstack (oldest at bottom):
Summary:
This patch makes allgather as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.
Test Plan:
python test/distributed/test_c10d_nccl.py -k test_allgather_ops
python test/distributed/test_c10d_gloo.py -k test_allgather_basics
...and other existing distributed tests.