[c10d] Make alltoall as a custom op#79691
[c10d] Make alltoall as a custom op#79691alanwaketan wants to merge 4 commits intogh/alanwaketan/40/basefrom
Conversation
Summary: This patch makes alltoall as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: Existing distributed tests. [ghstack-poisoned]
🔗 Helpful links
✅ No Failures (0 Pending)As of commit 1d79a54 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Summary: This patch makes alltoall as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: Existing distributed tests. ghstack-source-id: 53b962b Pull Request resolved: #79691
Summary: This patch makes alltoall as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: Existing distributed tests. [ghstack-poisoned]
Summary: This patch makes alltoall as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda_complex BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_full_group_cuda and other existing distributed tests. ghstack-source-id: 4cb4d44 Pull Request resolved: #79691
Summary:
This patch makes alltoall as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.
Test Plan:
BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda
BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda_complex
BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_full_group_cuda
and other existing distributed tests.
[ghstack-poisoned]
| } | ||
|
|
||
| c10::intrusive_ptr<ProcessGroup::Work> alltoall_( | ||
| at::TensorList output_tensors, |
There was a problem hiding this comment.
why the type here is different from other comm ops?
There was a problem hiding this comment.
not stamping yet due to this comment
There was a problem hiding this comment.
I think it's a convention to use TensorList to represent const std::vectorat::Tensor&.
I only use const std::vectorat::Tensor& if there is a const std::vector<std::vectorat::Tensor>& in the signature to keep consistency on that function. Let me know which way you like.
There was a problem hiding this comment.
I see. can we add some docs to explain that in the code? Thanks!
torch/csrc/distributed/c10d/Ops.hpp
Outdated
| const std::vector<std::vector<at::Tensor>>& input_tensors, | ||
| const ScatterOptions& opts ={}); | ||
| TORCH_API c10::intrusive_ptr<ProcessGroup::Work> alltoall(const c10::intrusive_ptr<ProcessGroup>& process_group, | ||
| at::TensorList output_tensors, |
There was a problem hiding this comment.
shall we run clang-format on this file. It might result in a different format.
There was a problem hiding this comment.
The pull request passes the linter. And the format seems consistent with other function signatures in the same file. What's your concern here?
There was a problem hiding this comment.
We used to run clang-format on all distributed cpp files. Not sure what's today's convention. Below is what clang-format gives me:
namespace c10d {
namespace ops {
// Below are essentially ProcessGroup's corresponding ops but routed to the
// dispatcher.
TORCH_API c10::intrusive_ptr<ProcessGroup::Work>
broadcast(const c10::intrusive_ptr<ProcessGroup> &process_group,
at::TensorList tensors, const BroadcastOptions &opts = {});
TORCH_API c10::intrusive_ptr<ProcessGroup::Work>
allreduce(const c10::intrusive_ptr<ProcessGroup> &process_group,
at::TensorList tensors, const AllreduceOptions &opts = {});
TORCH_API c10::intrusive_ptr<ProcessGroup::Work>
allgather(const c10::intrusive_ptr<ProcessGroup> &process_group,
const std::vector<std::vector<at::Tensor>> &output_tensors,
const std::vector<at::Tensor> &input_tensors,
const AllgatherOptions &opts = {});
TORCH_API c10::intrusive_ptr<ProcessGroup::Work>
reduce_scatter(const c10::intrusive_ptr<ProcessGroup> &process_group,
const std::vector<at::Tensor> &output_tensors,
const std::vector<std::vector<at::Tensor>> &input_tensors,
const ReduceScatterOptions &opts = {});
TORCH_API c10::intrusive_ptr<ProcessGroup::Work>
reduce(const c10::intrusive_ptr<ProcessGroup> &process_group,
at::TensorList tensors, const ReduceOptions &opts = {});
TORCH_API c10::intrusive_ptr<ProcessGroup::Work>
gather(const c10::intrusive_ptr<ProcessGroup> &process_group,
const std::vector<std::vector<at::Tensor>> &output_tensors,
const std::vector<at::Tensor> &input_tensors,
const GatherOptions &opts = {});
TORCH_API c10::intrusive_ptr<ProcessGroup::Work>
scatter(const c10::intrusive_ptr<ProcessGroup> &process_group,
const std::vector<at::Tensor> &output_tensors,
const std::vector<std::vector<at::Tensor>> &input_tensors,
const ScatterOptions &opts = {});
TORCH_API c10::intrusive_ptr<ProcessGroup::Work>
alltoall(const c10::intrusive_ptr<ProcessGroup> &process_group,
at::TensorList output_tensors, at::TensorList input_tensors,
const AllToAllOptions &opts = {});
} // namespace ops
} // namespace c10dThere was a problem hiding this comment.
Somehow, my local linter doesn't produce the same output as yours. Let me just copy yours.
| .def( | ||
| "alltoall", | ||
| [](::c10d::ProcessGroup& pg, | ||
| [](const c10::intrusive_ptr<::c10d::ProcessGroup>& pg, |
mrshenli
left a comment
There was a problem hiding this comment.
LGTM. Left a minor comment
torch/csrc/distributed/c10d/Ops.hpp
Outdated
| const std::vector<std::vector<at::Tensor>>& input_tensors, | ||
| const ScatterOptions& opts ={}); | ||
| TORCH_API c10::intrusive_ptr<ProcessGroup::Work> alltoall(const c10::intrusive_ptr<ProcessGroup>& process_group, | ||
| at::TensorList output_tensors, |
There was a problem hiding this comment.
We used to run clang-format on all distributed cpp files. Not sure what's today's convention. Below is what clang-format gives me:
namespace c10d {
namespace ops {
// Below are essentially ProcessGroup's corresponding ops but routed to the
// dispatcher.
TORCH_API c10::intrusive_ptr<ProcessGroup::Work>
broadcast(const c10::intrusive_ptr<ProcessGroup> &process_group,
at::TensorList tensors, const BroadcastOptions &opts = {});
TORCH_API c10::intrusive_ptr<ProcessGroup::Work>
allreduce(const c10::intrusive_ptr<ProcessGroup> &process_group,
at::TensorList tensors, const AllreduceOptions &opts = {});
TORCH_API c10::intrusive_ptr<ProcessGroup::Work>
allgather(const c10::intrusive_ptr<ProcessGroup> &process_group,
const std::vector<std::vector<at::Tensor>> &output_tensors,
const std::vector<at::Tensor> &input_tensors,
const AllgatherOptions &opts = {});
TORCH_API c10::intrusive_ptr<ProcessGroup::Work>
reduce_scatter(const c10::intrusive_ptr<ProcessGroup> &process_group,
const std::vector<at::Tensor> &output_tensors,
const std::vector<std::vector<at::Tensor>> &input_tensors,
const ReduceScatterOptions &opts = {});
TORCH_API c10::intrusive_ptr<ProcessGroup::Work>
reduce(const c10::intrusive_ptr<ProcessGroup> &process_group,
at::TensorList tensors, const ReduceOptions &opts = {});
TORCH_API c10::intrusive_ptr<ProcessGroup::Work>
gather(const c10::intrusive_ptr<ProcessGroup> &process_group,
const std::vector<std::vector<at::Tensor>> &output_tensors,
const std::vector<at::Tensor> &input_tensors,
const GatherOptions &opts = {});
TORCH_API c10::intrusive_ptr<ProcessGroup::Work>
scatter(const c10::intrusive_ptr<ProcessGroup> &process_group,
const std::vector<at::Tensor> &output_tensors,
const std::vector<std::vector<at::Tensor>> &input_tensors,
const ScatterOptions &opts = {});
TORCH_API c10::intrusive_ptr<ProcessGroup::Work>
alltoall(const c10::intrusive_ptr<ProcessGroup> &process_group,
at::TensorList output_tensors, at::TensorList input_tensors,
const AllToAllOptions &opts = {});
} // namespace ops
} // namespace c10d
Summary:
This patch makes alltoall as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.
Test Plan:
BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda
BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda_complex
BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_full_group_cuda
and other existing distributed tests.
[ghstack-poisoned]
|
@pytorchbot merge --green |
|
Thanks for approving this pull request, Shen and Wanchao. |
|
@pytorchbot successfully started a merge job. Check the current status here |
Summary: This patch makes alltoall as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Pull Request resolved: #79691 Approved by: https://github.com/mrshenli, https://github.com/wanchaol Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/dcd17357a4aa7ed175dc6cfd46f502499dd4bdb0 Test plan from GitHub: BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda_complex BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_full_group_cuda and other existing distributed tests. Reviewed By: atalman Differential Revision: D37455827 Pulled By: alanwaketan fbshipit-source-id: 6745fd7d81a89b47e291da786e4511a6ce76be12
Stack from ghstack (oldest at bottom):
Summary:
This patch makes alltoall as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.
Test Plan:
BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda
BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda_complex
BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_full_group_cuda
and other existing distributed tests.