Skip to content

[c10d] Make allgather as a custom op#79669

Closed
alanwaketan wants to merge 3 commits intogh/alanwaketan/35/basefrom
gh/alanwaketan/35/head
Closed

[c10d] Make allgather as a custom op#79669
alanwaketan wants to merge 3 commits intogh/alanwaketan/35/basefrom
gh/alanwaketan/35/head

Conversation

@alanwaketan
Copy link
Copy Markdown
Collaborator

@alanwaketan alanwaketan commented Jun 16, 2022

Stack from ghstack (oldest at bottom):

Summary:
This patch makes allgather as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
python test/distributed/test_c10d_nccl.py -k test_allgather_ops
python test/distributed/test_c10d_gloo.py -k test_allgather_basics
...and other existing distributed tests.

Summary:
This patch makes allgather as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
python test/distributed/test_c10d_nccl.py -k test_allgather_ops
python test/distributed/test_c10d_gloo.py -k test_allgather_basics
...and other existing distributed tests.

[ghstack-poisoned]
@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Jun 16, 2022

🔗 Helpful links

❌ 1 New Failures

As of commit 075cca0 (more details on the Dr. CI page):

Expand to see more
  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build pull / linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-22T22:58:19.4167431Z ##[error]Process completed with exit code 1.
2022-06-22T22:58:19.1795558Z �[1A�[K�[32mINFO: �[0mElapsed time: 164.408s
2022-06-22T22:58:19.1796044Z �[32mLoading:�[0m 0 packages loaded
2022-06-22T22:58:19.1849924Z 
2022-06-22T22:58:19.1850585Z �[1A�[K�[32mINFO: �[0m0 processes.
2022-06-22T22:58:19.1851064Z �[32mLoading:�[0m 0 packages loaded
2022-06-22T22:58:19.1851205Z 
2022-06-22T22:58:19.1851461Z �[1A�[K�[31m�[1mFAILED:�[0m Build did NOT complete successfully (0 packages loaded)
2022-06-22T22:58:19.1851647Z 
2022-06-22T22:58:19.1851857Z �[1A�[K�[31m�[1mFAILED:�[0m Build did NOT complete successfully (0 packages loaded)
2022-06-22T22:58:19.1942873Z �[0mFailed to build external libraries: ['/var/lib/jenkins/workspace/xla/build_torch_xla_libs.sh', '-O', '-D_GLIBCXX_USE_CXX11_ABI=1', 'install']
2022-06-22T22:58:19.4167431Z ##[error]Process completed with exit code 1.
2022-06-22T22:58:19.4211607Z Prepare all required actions
2022-06-22T22:58:19.4211957Z Getting action download info
2022-06-22T22:58:19.6084529Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-06-22T22:58:19.6084756Z with:
2022-06-22T22:58:19.6085182Z   github-token: ***
2022-06-22T22:58:19.6085338Z env:
2022-06-22T22:58:19.6085511Z   GIT_DEFAULT_BRANCH: master
2022-06-22T22:58:19.6085694Z ##[endgroup]
2022-06-22T22:58:19.6111722Z ##[group]Run nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
2022-06-22T22:58:19.6111958Z with:

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 16, 2022
@alanwaketan alanwaketan requested review from aazzolini, ezyang and wconstab and removed request for H-Huang, awgu, mingzhe09088, mrshenli, rohan-varma and zhaojuanmao June 16, 2022 01:50
Summary:
This patch makes allgather as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
python test/distributed/test_c10d_nccl.py -k test_allgather_ops
python test/distributed/test_c10d_gloo.py -k test_allgather_basics
...and other existing distributed tests.

[ghstack-poisoned]
alanwaketan pushed a commit that referenced this pull request Jun 16, 2022
Summary:
This patch makes allgather as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
python test/distributed/test_c10d_nccl.py -k test_allgather_ops
python test/distributed/test_c10d_gloo.py -k test_allgather_basics
...and other existing distributed tests.

ghstack-source-id: 98f6ccb
Pull Request resolved: #79669
@alanwaketan alanwaketan requested a review from mrshenli June 16, 2022 20:56
Summary:
This patch makes allgather as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
python test/distributed/test_c10d_nccl.py -k test_allgather_ops
python test/distributed/test_c10d_gloo.py -k test_allgather_basics
...and other existing distributed tests.

[ghstack-poisoned]
alanwaketan pushed a commit that referenced this pull request Jun 22, 2022
Summary:
This patch makes allgather as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
python test/distributed/test_c10d_nccl.py -k test_allgather_ops
python test/distributed/test_c10d_gloo.py -k test_allgather_basics
...and other existing distributed tests.

ghstack-source-id: 8676b05
Pull Request resolved: #79669
@alanwaketan alanwaketan requested a review from wanchaol June 22, 2022 22:34
Copy link
Copy Markdown
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'd vote for let's some tests to cover both forward and backward (i.e., torch.distributed.nn.*). It does not have to be in this PR, can be a PR on top. Just wanna verify we didn't miss anything. :)

const c10::intrusive_ptr<ProcessGroup>& process_group,
int64_t timeout) {
return process_group->allgather(
const_cast<std::vector<std::vector<at::Tensor>>&>(output_tensors),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to modify allgather to take const std::vector as well if necessary. Shall we create an issue to track that?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I agreed. PG's should accept const vectors& instead of vectors& given what we do is to update the content of the values within the vectors and not the vectors itself. Let me create an issue. #80241.

@alanwaketan
Copy link
Copy Markdown
Collaborator Author

Thanks Shen for approving this pull request. The CI failure doesn't seem to be related.

@alanwaketan
Copy link
Copy Markdown
Collaborator Author

@pytorchbot merge -f

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@github-actions
Copy link
Copy Markdown
Contributor

Hey @alanwaketan.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@alanwaketan
Copy link
Copy Markdown
Collaborator Author

LGTM! I'd vote for let's some tests to cover both forward and backward (i.e., torch.distributed.nn.*). It does not have to be in this PR, can be a PR on top. Just wanna verify we didn't miss anything. :)

Sounds good. Opened #80242 to keep track of this.

facebook-github-bot pushed a commit that referenced this pull request Jun 27, 2022
Summary:
This patch makes allgather as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Pull Request resolved: #79669
Approved by: https://github.com/mrshenli

Test Plan:
contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/3359af93909bb6009659e01e81eabb9c8dbb4b3a

Test plan from GitHub:
python test/distributed/test_c10d_nccl.py -k test_allgather_ops
python test/distributed/test_c10d_gloo.py -k test_allgather_basics
...and other existing distributed tests.

Reviewed By: atalman

Differential Revision: D37455652

Pulled By: alanwaketan

fbshipit-source-id: 058691c6f5a59807b585a22a93353e91313f8b6f
@facebook-github-bot facebook-github-bot deleted the gh/alanwaketan/35/head branch June 28, 2022 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category topic: new features topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants