[c10d] Make alltoall as a custom op by alanwaketan · Pull Request #79691 · pytorch/pytorch

alanwaketan · 2022-06-16T08:16:40Z

Stack from ghstack (oldest at bottom):

Summary:
This patch makes alltoall as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda
BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda_complex
BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_full_group_cuda
and other existing distributed tests.

Summary: This patch makes alltoall as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: Existing distributed tests. [ghstack-poisoned]

facebook-github-bot · 2022-06-16T08:16:47Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/79691
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit 1d79a54 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Summary: This patch makes alltoall as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: Existing distributed tests. ghstack-source-id: 53b962b Pull Request resolved: #79691

Summary: This patch makes alltoall as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: Existing distributed tests. [ghstack-poisoned]

Summary: This patch makes alltoall as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda_complex BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_full_group_cuda and other existing distributed tests. ghstack-source-id: 4cb4d44 Pull Request resolved: #79691

Summary: This patch makes alltoall as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda_complex BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_full_group_cuda and other existing distributed tests. [ghstack-poisoned]

mrshenli · 2022-06-24T15:01:48Z

torch/csrc/distributed/c10d/Ops.cpp

 }

+c10::intrusive_ptr<ProcessGroup::Work> alltoall_(
+    at::TensorList output_tensors,


why the type here is different from other comm ops?

not stamping yet due to this comment

I think it's a convention to use TensorList to represent const std::vectorat::Tensor&.

I only use const std::vectorat::Tensor& if there is a const std::vector<std::vectorat::Tensor>& in the signature to keep consistency on that function. Let me know which way you like.

I see. can we add some docs to explain that in the code? Thanks!

mrshenli · 2022-06-24T15:02:37Z

torch/csrc/distributed/c10d/Ops.hpp

      const std::vector<std::vector<at::Tensor>>& input_tensors,
      const ScatterOptions& opts ={});
+TORCH_API c10::intrusive_ptr<ProcessGroup::Work> alltoall(const c10::intrusive_ptr<ProcessGroup>& process_group,
+      at::TensorList output_tensors,


shall we run clang-format on this file. It might result in a different format.

The pull request passes the linter. And the format seems consistent with other function signatures in the same file. What's your concern here?

We used to run clang-format on all distributed cpp files. Not sure what's today's convention. Below is what clang-format gives me:

namespace c10d { namespace ops { // Below are essentially ProcessGroup's corresponding ops but routed to the // dispatcher. TORCH_API c10::intrusive_ptr<ProcessGroup::Work> broadcast(const c10::intrusive_ptr<ProcessGroup> &process_group, at::TensorList tensors, const BroadcastOptions &opts = {}); TORCH_API c10::intrusive_ptr<ProcessGroup::Work> allreduce(const c10::intrusive_ptr<ProcessGroup> &process_group, at::TensorList tensors, const AllreduceOptions &opts = {}); TORCH_API c10::intrusive_ptr<ProcessGroup::Work> allgather(const c10::intrusive_ptr<ProcessGroup> &process_group, const std::vector<std::vector<at::Tensor>> &output_tensors, const std::vector<at::Tensor> &input_tensors, const AllgatherOptions &opts = {}); TORCH_API c10::intrusive_ptr<ProcessGroup::Work> reduce_scatter(const c10::intrusive_ptr<ProcessGroup> &process_group, const std::vector<at::Tensor> &output_tensors, const std::vector<std::vector<at::Tensor>> &input_tensors, const ReduceScatterOptions &opts = {}); TORCH_API c10::intrusive_ptr<ProcessGroup::Work> reduce(const c10::intrusive_ptr<ProcessGroup> &process_group, at::TensorList tensors, const ReduceOptions &opts = {}); TORCH_API c10::intrusive_ptr<ProcessGroup::Work> gather(const c10::intrusive_ptr<ProcessGroup> &process_group, const std::vector<std::vector<at::Tensor>> &output_tensors, const std::vector<at::Tensor> &input_tensors, const GatherOptions &opts = {}); TORCH_API c10::intrusive_ptr<ProcessGroup::Work> scatter(const c10::intrusive_ptr<ProcessGroup> &process_group, const std::vector<at::Tensor> &output_tensors, const std::vector<std::vector<at::Tensor>> &input_tensors, const ScatterOptions &opts = {}); TORCH_API c10::intrusive_ptr<ProcessGroup::Work> alltoall(const c10::intrusive_ptr<ProcessGroup> &process_group, at::TensorList output_tensors, at::TensorList input_tensors, const AllToAllOptions &opts = {}); } // namespace ops } // namespace c10d

Somehow, my local linter doesn't produce the same output as yours. Let me just copy yours.

mrshenli · 2022-06-24T15:02:46Z

torch/csrc/distributed/c10d/init.cpp

          .def(
              "alltoall",
-              [](::c10d::ProcessGroup& pg,
+              [](const c10::intrusive_ptr<::c10d::ProcessGroup>& pg,


ditto on naming

Let me follow up on #80246.

mrshenli

LGTM. Left a minor comment

mrshenli · 2022-06-25T00:02:25Z

torch/csrc/distributed/c10d/Ops.hpp

      const std::vector<std::vector<at::Tensor>>& input_tensors,
      const ScatterOptions& opts ={});
+TORCH_API c10::intrusive_ptr<ProcessGroup::Work> alltoall(const c10::intrusive_ptr<ProcessGroup>& process_group,
+      at::TensorList output_tensors,


We used to run clang-format on all distributed cpp files. Not sure what's today's convention. Below is what clang-format gives me:

namespace c10d { namespace ops { // Below are essentially ProcessGroup's corresponding ops but routed to the // dispatcher. TORCH_API c10::intrusive_ptr<ProcessGroup::Work> broadcast(const c10::intrusive_ptr<ProcessGroup> &process_group, at::TensorList tensors, const BroadcastOptions &opts = {}); TORCH_API c10::intrusive_ptr<ProcessGroup::Work> allreduce(const c10::intrusive_ptr<ProcessGroup> &process_group, at::TensorList tensors, const AllreduceOptions &opts = {}); TORCH_API c10::intrusive_ptr<ProcessGroup::Work> allgather(const c10::intrusive_ptr<ProcessGroup> &process_group, const std::vector<std::vector<at::Tensor>> &output_tensors, const std::vector<at::Tensor> &input_tensors, const AllgatherOptions &opts = {}); TORCH_API c10::intrusive_ptr<ProcessGroup::Work> reduce_scatter(const c10::intrusive_ptr<ProcessGroup> &process_group, const std::vector<at::Tensor> &output_tensors, const std::vector<std::vector<at::Tensor>> &input_tensors, const ReduceScatterOptions &opts = {}); TORCH_API c10::intrusive_ptr<ProcessGroup::Work> reduce(const c10::intrusive_ptr<ProcessGroup> &process_group, at::TensorList tensors, const ReduceOptions &opts = {}); TORCH_API c10::intrusive_ptr<ProcessGroup::Work> gather(const c10::intrusive_ptr<ProcessGroup> &process_group, const std::vector<std::vector<at::Tensor>> &output_tensors, const std::vector<at::Tensor> &input_tensors, const GatherOptions &opts = {}); TORCH_API c10::intrusive_ptr<ProcessGroup::Work> scatter(const c10::intrusive_ptr<ProcessGroup> &process_group, const std::vector<at::Tensor> &output_tensors, const std::vector<std::vector<at::Tensor>> &input_tensors, const ScatterOptions &opts = {}); TORCH_API c10::intrusive_ptr<ProcessGroup::Work> alltoall(const c10::intrusive_ptr<ProcessGroup> &process_group, at::TensorList output_tensors, at::TensorList input_tensors, const AllToAllOptions &opts = {}); } // namespace ops } // namespace c10d

Summary: This patch makes alltoall as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda_complex BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_full_group_cuda and other existing distributed tests. [ghstack-poisoned]

alanwaketan · 2022-06-27T08:15:07Z

@pytorchbot merge --green

alanwaketan · 2022-06-27T08:16:16Z

Thanks for approving this pull request, Shen and Wanchao.

pytorchmergebot · 2022-06-27T08:16:47Z

@pytorchbot successfully started a merge job. Check the current status here

Summary: This patch makes alltoall as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Pull Request resolved: #79691 Approved by: https://github.com/mrshenli, https://github.com/wanchaol Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/dcd17357a4aa7ed175dc6cfd46f502499dd4bdb0 Test plan from GitHub: BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda_complex BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_full_group_cuda and other existing distributed tests. Reviewed By: atalman Differential Revision: D37455827 Pulled By: alanwaketan fbshipit-source-id: 6745fd7d81a89b47e291da786e4511a6ce76be12

alanwaketan requested review from H-Huang, awgu, mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners June 16, 2022 08:16

This was referenced Jun 16, 2022

[c10d] Make allreduce as a custom op #79582

Closed

[c10d] Make allgather as a custom op #79669

Closed

facebook-github-bot added the cla signed label Jun 16, 2022

alanwaketan mentioned this pull request Jun 16, 2022

[c10d] Make reduce_scatter as a custom op #79683

Closed

This was referenced Jun 16, 2022

[c10d] Make reduce as a custom op #79686

Closed

[c10d] Make gather as a custom op #79687

Closed

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 16, 2022

alanwaketan mentioned this pull request Jun 16, 2022

[c10d] Make scatter as a custom op #79688

Closed

alanwaketan removed request for H-Huang, awgu, mingzhe09088, mrshenli, rohan-varma and zhaojuanmao June 16, 2022 08:19

alanwaketan requested review from aazzolini, ezyang, mrshenli and wconstab June 16, 2022 20:58

This was referenced Jun 17, 2022

[c10d] Make barrier as a custom op #79777

Closed

[c10d] Make send/recv as custom ops #79779

Closed

alanwaketan requested a review from wanchaol June 22, 2022 23:46

mrshenli added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 24, 2022

mrshenli reviewed Jun 24, 2022

View reviewed changes

mrshenli approved these changes Jun 25, 2022

View reviewed changes

wanchaol approved these changes Jun 25, 2022

View reviewed changes

pytorchmergebot closed this in dcd1735 Jun 27, 2022

pytorchmergebot added the Merged label Jun 27, 2022

alanwaketan added release notes: distributed (c10d) release notes category topic: new features topic category labels Jun 27, 2022

facebook-github-bot deleted the gh/alanwaketan/40/head branch June 30, 2022 14:17

Conversation

alanwaketan commented Jun 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

✅ No Failures (0 Pending)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alanwaketan commented Jun 27, 2022

Uh oh!

alanwaketan commented Jun 27, 2022

Uh oh!

pytorchmergebot commented Jun 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alanwaketan commented Jun 16, 2022 •

edited

Loading

facebook-github-bot commented Jun 16, 2022 •

edited

Loading