[c10d] Make allreduce as a custom op by alanwaketan · Pull Request #79582 · pytorch/pytorch

alanwaketan · 2022-06-14T23:58:11Z

Stack from ghstack (oldest at bottom):

Summary:
This patch makes allreduce as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
python test/distributed/test_c10d_nccl.py -k test_allreduce_ops
python test/distributed/test_c10d_gloo.py -k test_allreduce_basics
...and other existing distributed tests.

Summary: This patch makes allreduce as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_allreduce_ops python test/distributed/test_c10d_gloo.py -k test_allreduce_basics ...and other existing distributed tests. [ghstack-poisoned]

facebook-github-bot · 2022-06-14T23:58:17Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/79582
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours
↩️ [fb-only] Re-run with SSH instructions

❌ 1 New Failures

As of commit 771024f (more details on the Dr. CI page):

Expand to see more

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

pull / linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-22T21:44:06.7073075Z ##[error]Process completed with exit code 1.

2022-06-22T21:44:06.4932978Z �[1A�[K�[32mINFO: �[0mElapsed time: 164.312s
2022-06-22T21:44:06.4933434Z �[32mLoading:�[0m 0 packages loaded
2022-06-22T21:44:06.4938218Z 
2022-06-22T21:44:06.4939582Z �[1A�[K�[32mINFO: �[0m0 processes.
2022-06-22T21:44:06.4940067Z �[32mLoading:�[0m 0 packages loaded
2022-06-22T21:44:06.4940895Z 
2022-06-22T21:44:06.4941701Z �[1A�[K�[31m�[1mFAILED:�[0m Build did NOT complete successfully (0 packages loaded)
2022-06-22T21:44:06.4976872Z 
2022-06-22T21:44:06.4982914Z �[1A�[K�[31m�[1mFAILED:�[0m Build did NOT complete successfully (0 packages loaded)
2022-06-22T21:44:06.5099123Z �[0mFailed to build external libraries: ['/var/lib/jenkins/workspace/xla/build_torch_xla_libs.sh', '-O', '-D_GLIBCXX_USE_CXX11_ABI=1', 'install']
2022-06-22T21:44:06.7073075Z ##[error]Process completed with exit code 1.
2022-06-22T21:44:06.7115473Z Prepare all required actions
2022-06-22T21:44:06.7115786Z Getting action download info
2022-06-22T21:44:06.8730430Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-06-22T21:44:06.8730653Z with:
2022-06-22T21:44:06.8730989Z   github-token: ***
2022-06-22T21:44:06.8731144Z env:
2022-06-22T21:44:06.8731319Z   GIT_DEFAULT_BRANCH: master
2022-06-22T21:44:06.8731507Z ##[endgroup]
2022-06-22T21:44:06.8757087Z ##[group]Run nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
2022-06-22T21:44:06.8757308Z with:

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Summary: This patch makes allreduce as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_allreduce_ops python test/distributed/test_c10d_gloo.py -k test_allreduce_basics ...and other existing distributed tests. [ghstack-poisoned]

Summary: This patch makes allreduce as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_allreduce_ops python test/distributed/test_c10d_gloo.py -k test_allreduce_basics ...and other existing distributed tests. ghstack-source-id: 1b1d2c9 Pull Request resolved: #79582

ezyang · 2022-06-16T18:15:05Z

torch/csrc/distributed/c10d/default_comm_hooks.cpp

+  auto allreduce_fut =
+      ops::allreduce(
+          c10::intrusive_ptr<ProcessGroup>::unsafe_reclaim_from_nonowning(
+              state_),


What's going on here?

I'm trying to convert a raw pointer to an intrusive_ptr. Is this the way to do so?

curious, who owns this PG instance? I assume it is owned by Python PG object? If that's the case, will this mess up the refcnt. What happens when this tmp intrusive ptr exits scope?

I think PG are normally owned by a Python object. AllReduceCommHook somehow holds a ProcessGroup* instead of intrusive_ptr. Therefore, I need to convert the raw pointer to a intrusive_ptr.

I don't believe this will mess up the refcnt. However, I actually think it's better to replace class AllReduceCommHook : public CppCommHookInterface<ProcessGroup*> with class AllReduceCommHook : public CppCommHookInterface<intrusive_ptr<ProcessGroup>>. What do you think?

I don't believe this will mess up the refcnt.

If we create an intrusive ptr from the raw ptr, does this mean we have two separate entities tracking refcnt for the same raw ptr separately? One is the Python object, and another is this intrusive ptr?

I actually think it's better to replace class AllReduceCommHook : public CppCommHookInterface<ProcessGroup*> with class AllReduceCommHook : public CppCommHookInterface<intrusive_ptr>. What do you think?

Yep, this does sounds better to me.

I think for intrusive_ptr the refcnt is stored in the object (ProcessGroup) itself. Intrusive_ptr is just a way to increment/decrement the refcnt. So it shouldn't matter.

Let me make a follow up patch on changing class AllReduceCommHook : public CppCommHookInterface<ProcessGroup*> to class AllReduceCommHook : public CppCommHookInterface<intrusive_ptr<ProcessGroup>>.

Got it. Can we also add a comment for this in the code? Thank you!

wanchaol

I guess I can add tests to have a python tensor that override torch_dispatch to directly verify that.

I see, I guess we can add those tests later in a separate PR when we necessarily need it. There's two things on top of my head and need some inputs from @mrshenli, as these might be related to the actual node appears in the IR, we should get some clarify and make them consistent:

about operator suffix and argument ordering: should we make the aten operator follow our python level API, or should we follow the ATen operator naming convention?
should we let wait() appear in the IR? This might be related to how the cuda stream sync works in our current tracer.

wanchaol · 2022-06-21T06:48:27Z

torch/csrc/distributed/c10d/Ops.cpp

          root_rank, root_tensor, std::chrono::milliseconds(timeout)});
 }

+c10::intrusive_ptr<ProcessGroup::Work> allreduce_(


Got it, thanks! One thing that captured my eyes about this TorchBind Work object, does not have methods like wait() binded, I guess this is fine initially as this PR is more about making it a dispatcher level op.

But I am wondering how this would be look like in our traced IR, should we have the wait() in the graph? how does the traced graph look like if we need async execution on a different cuda stream where user usually need to manually wait for stream? cc @mrshenli

alanwaketan · 2022-06-21T16:52:03Z

about operator suffix and argument ordering: should we make the aten operator follow our python level API, or should we follow the ATen operator naming convention?

I think we should follow the aten convention for the funtion schema as that will be easier for any tracer to interpret the ops. At least AOT would assume the aten convention.

should we let wait() appear in the IR? This might be related to how the cuda stream sync works in our current tracer.

Please see my other comments for the short term solution. Long term wise, yes we need a way to represent cuda streams in the graph. We don't know how yet.

wanchaol

Looks good to me, looks like the CI failure is real:

Broken ops: [
	c10d::broadcast(__torch__.torch.classes.c10d.ProcessGroup _0, Tensor[] _1, int _2, int _3, int _4) -> __torch__.torch.classes.c10d.Work _0
]

Could you fix the CI issue before landing? Thanks!

alanwaketan · 2022-06-22T20:39:49Z

Looks good to me, looks like the CI failure is real:
Broken ops: [
	c10d::broadcast(__torch__.torch.classes.c10d.ProcessGroup _0, Tensor[] _1, int _2, int _3, int _4) -> __torch__.torch.classes.c10d.Work _0
]
Could you fix the CI issue before landing? Thanks!

Thanks, Wanchao. I believe it's intended to break the schema. Do you know how to update the test expectation of the backward_compat test?

Summary: This patch makes allreduce as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_allreduce_ops python test/distributed/test_c10d_gloo.py -k test_allreduce_basics ...and other existing distributed tests. [ghstack-poisoned]

Summary: This patch makes allreduce as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_allreduce_ops python test/distributed/test_c10d_gloo.py -k test_allreduce_basics ...and other existing distributed tests. ghstack-source-id: 220759b Pull Request resolved: #79582

wanchaol · 2022-06-22T21:31:42Z

test/forward_backward_compatibility/check_forward_backward_compatibility.py

    ("aten::segment_reduce", datetime.date(2022, 6, 30)),
    ("aten::_segment_reduce_backward", datetime.date(2022, 6, 30)),
    ("aten::empty.SymInt", datetime.date(9999, 1, 1)),
+    ("c10d::broadcast", datetime.date(2022, 6, 25)),


yeah this is the correct way :) although i am not sure which namespace broadcast got binded to, looks like it's c10d and we can see if this get it passed

alanwaketan · 2022-06-23T00:59:45Z

The XLA failure doesn't seem to be related.

alanwaketan · 2022-06-23T08:33:13Z

@pytorchbot merge -f

pytorchmergebot · 2022-06-23T08:34:25Z

@pytorchbot successfully started a merge job. Check the current status here

github-actions · 2022-06-23T08:35:23Z

Hey @alanwaketan.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: This patch makes allreduce as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Pull Request resolved: #79582 Approved by: https://github.com/wanchaol Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/e5841bafbd2868eaf6eb7b89b4caf3a6261dcfa6 Test plan from GitHub: python test/distributed/test_c10d_nccl.py -k test_allreduce_ops python test/distributed/test_c10d_gloo.py -k test_allreduce_basics ...and other existing distributed tests. Reviewed By: atalman Differential Revision: D37382098 Pulled By: alanwaketan fbshipit-source-id: 068fd6d8f2c3fa3998431dcf878e14bd41890693

mrshenli · 2022-06-24T14:21:12Z

should we let wait() appear in the IR? This might be related to how the cuda stream sync works in our current tracer.

Can this be represented as edges in the graph?

alanwaketan · 2022-06-24T20:19:43Z

should we let wait() appear in the IR? This might be related to how the cuda stream sync works in our current tracer.

Can this be represented as edges in the graph?

I think we need more discussions on this. Let me try to organize a follow up meeting.

Signed-off-by: Masaki Kozuki <[email protected]> Co-authored-by: ptrblck <[email protected]> Co-authored-by: Michael Carilli <[email protected]> Patch for pytorch#79582 Apparently 79852 is newer than 34 and the commit below so the PR assumes `ReduceOp` to be an `enum`, not a `struct` including an `enum` inside it.

alanwaketan requested review from H-Huang, awgu, mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners June 14, 2022 23:58

facebook-github-bot added the cla signed label Jun 14, 2022

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 14, 2022

alanwaketan requested review from aazzolini, ezyang and wconstab and removed request for H-Huang, awgu, mingzhe09088, mrshenli, rohan-varma and zhaojuanmao June 15, 2022 00:03

ezyang reviewed Jun 16, 2022

View reviewed changes

wanchaol reviewed Jun 21, 2022

View reviewed changes

wanchaol approved these changes Jun 22, 2022

View reviewed changes

wanchaol reviewed Jun 22, 2022

View reviewed changes

pytorch deleted a comment from pytorch-bot bot Jun 23, 2022

pytorch deleted a comment from pytorchmergebot Jun 23, 2022

pytorchmergebot added the Merged label Jun 23, 2022

pytorchmergebot closed this in e5841ba Jun 23, 2022

alanwaketan added release notes: distributed (c10d) release notes category topic: new features topic category labels Jun 23, 2022

alanwaketan mentioned this pull request Jun 24, 2022

Change AllReduceCommHook to accept an instrusive_ptr #80243

Closed

facebook-github-bot deleted the gh/alanwaketan/34/head branch June 26, 2022 14:16

Conversation

alanwaketan commented Jun 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

❌ 1 New Failures

🕵️ 1 new failure recognized by patterns

pull / linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge) (1/1)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alanwaketan commented Jun 21, 2022

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

alanwaketan commented Jun 22, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alanwaketan commented Jun 23, 2022

Uh oh!

alanwaketan commented Jun 23, 2022

Uh oh!

pytorchmergebot commented Jun 23, 2022

Uh oh!

github-actions bot commented Jun 23, 2022

Uh oh!

mrshenli commented Jun 24, 2022

Uh oh!

alanwaketan commented Jun 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

alanwaketan commented Jun 14, 2022 •

edited

Loading

facebook-github-bot commented Jun 14, 2022 •

edited

Loading