[c10d] Make send/recv as custom ops#79779
[c10d] Make send/recv as custom ops#79779alanwaketan wants to merge 5 commits intogh/alanwaketan/42/basefrom
Conversation
Summary: This patch makes send/recv as custom ops such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_send_recv ...and other existing distributed tests. [ghstack-poisoned]
🔗 Helpful links
✅ No Failures (0 Pending)As of commit f39860d (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Summary: This patch makes send/recv as custom ops such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_send_recv ...and other existing distributed tests. ghstack-source-id: d0536fb Pull Request resolved: #79779
Summary: This patch makes send/recv as custom ops such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_send_recv ...and other existing distributed tests. [ghstack-poisoned]
Summary: This patch makes send/recv as custom ops such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_send_recv ...and other existing distributed tests. ghstack-source-id: c627df1 Pull Request resolved: #79779
Summary: This patch makes send/recv as custom ops such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_send_recv ...and other existing distributed tests. [ghstack-poisoned]
Summary: This patch makes send/recv as custom ops such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_send_recv ...and other existing distributed tests. ghstack-source-id: b5cad84 Pull Request resolved: #79779
| } | ||
|
|
||
| c10::intrusive_ptr<ProcessGroup::Work> send( | ||
| at::TensorList tensors, |
There was a problem hiding this comment.
same comment on TensorList vs std::vector<Tensor>
There was a problem hiding this comment.
I think it's a convention to use TensorList to represent const std::vectorat::Tensor&.
I only use const std::vectorat::Tensor& if there is a const std::vectorstd::vectorat::Tensor& in the signature to keep consistency on that function. Let me know which way you like.
| BarrierOptions{device_ids, std::chrono::milliseconds(timeout)}); | ||
| } | ||
|
|
||
| c10::intrusive_ptr<ProcessGroup::Work> send( |
There was a problem hiding this comment.
Send is not a in-place op. So there is no needs to have a trailing _.
wanchaol
left a comment
There was a problem hiding this comment.
lgtm, have some questions.
| int64_t dstRank, | ||
| int64_t tag) { | ||
| static auto op = c10::Dispatcher::singleton() | ||
| .findSchemaOrThrow("c10d::send", "") |
There was a problem hiding this comment.
hmm would we not find the schema by any chance?
There was a problem hiding this comment.
Why would you have such concern? Do you happen to know another op called "c10d::send" as well?
| return op.call(tensors, process_group, dstRank, tag); | ||
| } | ||
|
|
||
| c10::intrusive_ptr<ProcessGroup::Work> recv( |
There was a problem hiding this comment.
just noticed we have this recv and recv_ two separate APIs (above), what's the difference between these two and should we merge?
There was a problem hiding this comment.
The above recv_ is the implementation of the function schema. This one is the C++ API for users to call the op. recv_ follows aten ops's conventions while recv follows c10d's conventions.
Summary: This patch makes send/recv as custom ops such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_send_recv ...and other existing distributed tests. [ghstack-poisoned]
Summary: This patch makes send/recv as custom ops such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_send_recv ...and other existing distributed tests. ghstack-source-id: 07f0d10 Pull Request resolved: #79779
|
@pytorchbot merge --green |
|
Thanks for approving this pull request, Shen and Wanchao. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Summary: This patch makes send/recv as custom ops such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_send_recv ...and other existing distributed tests. [ghstack-poisoned]
Summary: This patch makes send/recv as custom ops such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_send_recv ...and other existing distributed tests. ghstack-source-id: b1c39c2 Pull Request resolved: #79779
|
@pytorchbot merge --green |
|
@pytorchbot successfully started a merge job. Check the current status here |
|
Hey @alanwaketan. |
Summary: This patch makes send/recv as custom ops such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Pull Request resolved: #79779 Approved by: https://github.com/mrshenli, https://github.com/wanchaol Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/79c2dfcd8ef3ab2ca14a942109242ae7440dd6b1 Test plan from GitHub: python test/distributed/test_c10d_nccl.py -k test_send_recv ...and other existing distributed tests. Reviewed By: b0noI Differential Revision: D37509682 Pulled By: alanwaketan fbshipit-source-id: a3a698f85b64b256618b8af95dd29aade4b50d74
Stack from ghstack (oldest at bottom):
Summary:
This patch makes send/recv as custom ops such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.
Test Plan:
python test/distributed/test_c10d_nccl.py -k test_send_recv
...and other existing distributed tests.