[resubmit][rpc] Add local shutdown to process group agent #30330

rohan-varma · 2019-11-22T19:43:15Z

Stack from ghstack:

[resubmit][rpc] Add local shutdown to process group agent #30330 [resubmit][rpc] Add local shutdown to process group agent
[test all]
Resubmit of [rpc] Add shutdown function to process group agent and refactor join #30020, which was reverted.
This is now possible due to previous changes made in gloo and ProcessGroupGloo. We abort the listener thread that is waiting for a message, and join all other threads. The destructor calls this same localShutdown method, but we ensure this is not called multiple times.

Differential Revision: D18661775

This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. ghstack-source-id: 94442415 Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/) [ghstack-poisoned]

test/rpc_test.py

This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/) [ghstack-poisoned]

zhaojuanmao · 2019-11-22T20:12:54Z

@rohan-varma and @mrshenli I'm wondering why we need to split into wait_all_workers() and shutdown() in this way if shutdown() always needs to follow by wait_all_workers()? looks like they are not independent.

mrshenli · 2019-11-22T20:27:09Z

@zhaojuanmao

One reason is that wait_all_workers could become a non-terminating API in the future, sth similar to dist.barrier(). So that applications can do some rpc/remote and then call wait_all_workers for a global sync, and then resume, and then maybe another global sync, and then shutdown. (as mentioned by @satgera) We don't explicitly say we support this yet, but we can discuss if that's what we want.

Resubmit of #30020, which was reverted. This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/) [ghstack-poisoned]

rohan-varma · 2019-11-23T00:05:22Z

torch/csrc/distributed/rpc/process_group_agent.cpp

+      recvWork_->abort();
+    }
+  }
+  threadPool_.waitWorkComplete();


I had to add this here in shutdown, otherwise the test test_rpc_shutdown() would flake with gloo connection reset errors. I don't think it delays the shutdown by too much, as it just waits for already enqueued work to complete.

hmm, does this need to be placed before abort? Otherwise, if there are indeed unfinished task, the receiving end listener thread might have already been aborted?

Hm, my reasoning for doing it after was that if it was before, we could possibly get more work while aborting the listenerThread, and then enqueue that recv, and not wait for it to be completed. For example, say that recvWork->wait has already been unblocked before we call its abort (via a message from another worker), we may have called enqueueRecv which will add a task to the thread pool.

zhaojuanmao · 2019-11-23T00:12:15Z

lgtm!

xush6528 · 2019-11-23T00:13:40Z

torch/distributed/rpc/api.py

    r"""
    Block until all local and remote RPC processes reach this method, and then
-    destroy local the RPC agent. Every RPC process must call this method before
+    destroy RRef and RPC handlers. Every RPC process must call this method before


"RPC handlers" -> "Python RPC handlers"

RPC handlers is vague.

This docstring is also in the wrong place now, will update. I actually don't think we need to mention that we destroy RPC handlers and RRef context, those are internal details that should be abstracted away from the user.

I actually don't think we need to mention that we destroy RPC handlers and RRef context, those are internal details that should be abstracted away from the user.

Agree, users don't need to know there is a RRef context and handler

Resubmit of #30020, which was reverted. This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/) [ghstack-poisoned]

Pull Request resolved: #30330 This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. ghstack-source-id: 94468192 ghstack-source-id: 94468192 Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)

mrshenli · 2019-11-23T00:51:57Z

torch/distributed/rpc/api.py

        >>> rpc.wait_all_workers()
+        >>> rpc.shutdown()
    """
    global _agent


(This can come in a followup PR)

We might not need this global _agent any more as we are not modifying it.

mrshenli · 2019-11-23T00:54:21Z

torch/csrc/distributed/rpc/process_group_agent.cpp

+      recvWork_->abort();
+    }
+  }
+  threadPool_.waitWorkComplete();


hmm, does this need to be placed before abort? Otherwise, if there are indeed unfinished task, the receiving end listener thread might have already been aborted?

mrshenli

Approve to unblock. Please wait for all tests to pass.

…oup agent" [test all] Resubmit of #30020, which was reverted. This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/) [ghstack-poisoned]

Pull Request resolved: #30330 This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. ghstack-source-id: 94480867 ghstack-source-id: 94480867 Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)

xush6528 · 2019-11-24T00:14:59Z

torch/csrc/distributed/rpc/process_group_agent.cpp

+  }
+  threadPool_.waitWorkComplete();
+  listenerThread_.join();
+  futureTimeoutCV_.notify_one();


What if the futureTimeoutThread_ is not waiting on the futureTimeoutCV_? For example, it could be at line std::chrono::milliseconds sleepTime;.

Will futureTimeoutThread_ miss the notify_one() and wait on the condition var forever?

I am afraid the shutdown test case will be flaky and some failure instances will be timeouts.

Found a tutorial discussing about how to use predicate to avoid losing notification, https://www.modernescpp.com/index.php/c-core-guidelines-be-aware-of-the-traps-of-condition-variables

Especially the second example, "An atomic predicate", is interesting. It educates us that even when we do rpcRnning_.store(true), we should acquire lock that is used by futureTimeoutCV_.

Thanks for pointing this out - went ahead and made the changes to hold the lock as you suggested.

xush6528 · 2019-11-24T01:40:19Z

torch/csrc/distributed/rpc/process_group_agent.cpp

 }

 void ProcessGroupAgent::start() {
+  rpcRunning_.store(true);


I changed the futureTimeoutCV_.wait_for(...) part in #30355.

We need to change this line to

{ std::lock_guar<std::mutex> futureLock{futureMutex_}; rpcRunning_.store(true); }

in #30355 as well.

torch/csrc/distributed/rpc/process_group_agent.cpp

mrshenli

Remove stamp until we address @xush6528's comment

[test all] Resubmit of #30020, which was reverted. This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/) [ghstack-poisoned]

Pull Request resolved: #30330 This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. ghstack-source-id: 94490200 ghstack-source-id: 94490200 Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)

[test all] Resubmit of #30020, which was reverted. This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/) [ghstack-poisoned]

Pull Request resolved: #30330 This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately. ghstack-source-id: 94642022 ghstack-source-id: 94642022 Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)

pietern · 2019-11-27T08:20:27Z

docs/source/notes/distributed_autograd.rst

+  # Run world_size workers.
+  world_size = 2
+  for i in range(world_size):
+      p = mp.Process(target=run_process, args=(i, (i + 1) % 2, world_size))


mp.spawn for the win!

Addressing in separate PR (#30381)

pietern · 2019-11-27T08:21:13Z

test/rpc_test.py

+
+def _check_rpc_done(rank_distance):
+    while not rpc_done[rank_distance]:
+        time.sleep(0)


Could add a comment here to say that this yields control to other threads so it won't be removed.

Will address in a follow up PR.

I ended up addressing it in this PR, since I had to do a rebase anyways.

pietern · 2019-11-27T08:23:38Z

torch/csrc/distributed/rpc/process_group_agent.h

+  // ProcessGroupAgent::start and unset in ProcessGroupAgent::shutdown and
+  // ProcessGroupAgent::join. It controls whether several background threads
+  // should be running.
+  std::atomic<bool> rpcRunning_{false};


Sounds good, but please document what the expected locking strategy is.

Using an atomic implies (at least to me) that you don't need locks. To reduce the probability of this causing breakage later on, I think a regular bool is better. Accessing that without locks is a red flag in code review.

rohan-varma · 2019-11-27T20:12:49Z

The following are the CI errors, all unrelated:

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:405-3796e2c20d4587ffdcae63b42490a63316be44bd-android-arm-v7a not found
Nov 27 08:55:40 FAIL [0.032s]: test_neg_cuda (main.TestTorchDeviceTypeCUDA)
Nov 27 08:55:40 ----------------------------------------------------------------------
Nov 27 08:55:40 RuntimeError: CUDA error: misaligned address
Nov 27 00:01:04 Traceback (most recent call last):
Nov 27 00:01:04 File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/soundfile.py", line 142, in
Nov 27 00:01:04 raise OSError('sndfile library not found')
Nov 27 00:01:04 OSError: sndfile library not found
08:34:39 FAIL: test_logical_or_cuda (main.TestTorchDeviceTypeCUDA)
08:34:39 ----------------------------------------------------------------------
08:34:39 Traceback (most recent call last):
08:34:39 File "/var/lib/jenkins/workspace/test/common_utils.py", line 631, in wrapper
08:34:39 method(*args, **kwargs)
08:34:39 File "/var/lib/jenkins/workspace/test/common_device_type.py", line 179, in instantiated_test
08:34:39 return test(self, device_arg)
08:34:39 File "test_torch.py", line 6578, in test_logical_or
08:34:39 self._test_logical(device, 'logical_or', [10, 0, 1, 0], [1, 0, 0, 10], [1, 0, 1, 1])
08:34:39 File "test_torch.py", line 6555, in _test_logical
08:34:39 self.assertEqual(expected_res.bool(), getattr(a, op)(b))
08:34:39 File "/var/lib/jenkins/workspace/test/common_utils.py", line 808, in assertEqual
08:34:39 assertTensorsEqual(x, y)
08:34:39 File "/var/lib/jenkins/workspace/test/common_utils.py", line 778, in assertTensorsEqual
08:34:39 self.assertLessEqual(max_err, prec, message)
08:34:39 AssertionError: tensor(1, device='cuda:0', dtype=torch.int32) not less than or equal to 1e-05 :

[test all] Resubmit of #30020, which was reverted. This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/) [ghstack-poisoned]

Pull Request resolved: #30330 This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately. ghstack-source-id: 94673884 ghstack-source-id: 94673884 Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)

rohan-varma · 2019-11-28T01:31:49Z

CI failure that is not listed as "broken upstream" has occured on other merged PRs (https://app.circleci.com/jobs/github/pytorch/pytorch/3733973, https://app.circleci.com/jobs/github/pytorch/pytorch/3734258), it should be fixed by this revert: a2ed50c

Summary: Pull Request resolved: #30330 This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately. ghstack-source-id: 94673884 ghstack-source-id: 94673884 Test Plan: Unit tests pass. Reviewed By: mrshenli Differential Revision: D18661775 fbshipit-source-id: 5aaa7c14603e18253394224994f6cd43234301c2

Summary: #30330 got rid of the need to send a `MessageType::SHUTDOWN` message, so we can now remove the logic/utils for this type of message. I think we can also delete the enum entry in the `enum MessageType`, but we may want to keep it in case the logic in #30710 is ever moved to C++. Pull Request resolved: #31270 Test Plan: All existing unit tests pass Differential Revision: D19146983 Pulled By: rohan-varma fbshipit-source-id: 35b185411f9446d7d4dfc37a6cb5477cf041e647

Summary: Pull Request resolved: pytorch#30330 This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately. ghstack-source-id: 94673884 ghstack-source-id: 94673884 Test Plan: Unit tests pass. Reviewed By: mrshenli Differential Revision: D18661775 fbshipit-source-id: 5aaa7c14603e18253394224994f6cd43234301c2

Summary: pytorch#30330 got rid of the need to send a `MessageType::SHUTDOWN` message, so we can now remove the logic/utils for this type of message. I think we can also delete the enum entry in the `enum MessageType`, but we may want to keep it in case the logic in pytorch#30710 is ever moved to C++. Pull Request resolved: pytorch#31270 Test Plan: All existing unit tests pass Differential Revision: D19146983 Pulled By: rohan-varma fbshipit-source-id: 35b185411f9446d7d4dfc37a6cb5477cf041e647

…() in ProcessGroupAgent::listenLoop" ungraceful shutdown ungraceful shutdown #30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)! [ghstack-poisoned]

…() in ProcessGroupAgent::listenLoop" ungraceful shutdown ungraceful shutdown ungraceful shutdown #30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)! [ghstack-poisoned]

…ssGroupAgent::listenLoop Pull Request resolved: #36084 #30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. ghstack-source-id: 101645227 Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!

…o RecvWork::wait() in ProcessGroupAgent::listenLoop" ungraceful shutdown ungraceful shutdown ungraceful shutdown ungraceful shutdown #30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)! [ghstack-poisoned]

…() in ProcessGroupAgent::listenLoop" ungraceful shutdown ungraceful shutdown ungraceful shutdown ungraceful shutdown #30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)! [ghstack-poisoned]

…ssGroupAgent::listenLoop Pull Request resolved: #36084 #30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. ghstack-source-id: 101689402 Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!

…ssGroupAgent::listenLoop (#36084) Summary: Pull Request resolved: #36084 #30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. ghstack-source-id: 101689402 Test Plan: Added test in ProcessGroupAgentTest. We also add a basic config that allows us to control whether to abort the call to `pg->recv()` and `pg->recvAnysource()` in `FailingWaitProcessGroupGloo`. Run test binary: ```buck build mode/dev-nosan //caffe2/torch/fb/distributed/thriftRpcBackend/test:ProcessGroupAgentTest --keep-going ~/fbcode/buck-out/gen/caffe2/torch/fb/distributed/thriftRpcBackend/test/ProcessGroupAgentTest ``` P128567144 Differential Revision: D20632764 fbshipit-source-id: c0b3c391fd3e0ae711661ad99f309ee4d93f6582

…ssGroupAgent::listenLoop (pytorch#36084) Summary: Pull Request resolved: pytorch#36084 pytorch#30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. ghstack-source-id: 101689402 Test Plan: Added test in ProcessGroupAgentTest. We also add a basic config that allows us to control whether to abort the call to `pg->recv()` and `pg->recvAnysource()` in `FailingWaitProcessGroupGloo`. Run test binary: ```buck build mode/dev-nosan //caffe2/torch/fb/distributed/thriftRpcBackend/test:ProcessGroupAgentTest --keep-going ~/fbcode/buck-out/gen/caffe2/torch/fb/distributed/thriftRpcBackend/test/ProcessGroupAgentTest ``` P128567144 Differential Revision: D20632764 fbshipit-source-id: c0b3c391fd3e0ae711661ad99f309ee4d93f6582

rohan-varma requested review from mrshenli, pritamdamania87 and zhaojuanmao as code owners November 22, 2019 19:43

mrshenli reviewed Nov 22, 2019

View reviewed changes

test/rpc_test.py Outdated Show resolved Hide resolved

rohan-varma commented Nov 23, 2019

View reviewed changes

xush6528 reviewed Nov 23, 2019

View reviewed changes

rohan-varma changed the title ~~[resubmit][rpc] Add local shutdown to process group agent~~ [test all][resubmit][rpc] Add local shutdown to process group agent Nov 23, 2019

mrshenli reviewed Nov 23, 2019

View reviewed changes

mrshenli approved these changes Nov 23, 2019

View reviewed changes

rohan-varma changed the title ~~[test all][resubmit][rpc] Add local shutdown to process group agent~~ [resubmit][rpc] Add local shutdown to process group agent Nov 23, 2019

xush6528 reviewed Nov 24, 2019

View reviewed changes

mrshenli requested changes Nov 24, 2019

View reviewed changes

rohan-varma requested review from mrshenli and xush6528 November 24, 2019 07:48

pietern reviewed Nov 27, 2019

View reviewed changes

facebook-github-bot closed this in 1350b99 Nov 28, 2019

rohan-varma mentioned this pull request Nov 28, 2019

Wait for shutdown message future and check errors in ProcessGroupAgent #27777

Closed

rohan-varma mentioned this pull request Dec 4, 2019

[v1.4.0 patch] Add local shutdown to process group agent (#30330) #30752

Merged

facebook-github-bot deleted the gh/rohan-varma/41/head branch December 10, 2019 15:20

rohan-varma mentioned this pull request Dec 13, 2019

Kill MessageType::SHUTDOWN related logic in pg agent #31270

Closed

rohan-varma mentioned this pull request Apr 6, 2020

[rpc] allow ability to abort second call to RecvWork::wait() in ProcessGroupAgent::listenLoop #36084

Closed

[resubmit][rpc] Add local shutdown to process group agent #30330

[resubmit][rpc] Add local shutdown to process group agent #30330

Uh oh!

Conversation

rohan-varma commented Nov 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zhaojuanmao commented Nov 22, 2019

Uh oh!

mrshenli commented Nov 22, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao commented Nov 23, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohan-varma Nov 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

xush6528 Nov 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohan-varma Nov 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xush6528 Nov 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohan-varma commented Nov 27, 2019

Uh oh!

rohan-varma commented Nov 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

rohan-varma commented Nov 22, 2019 •

edited

Loading

rohan-varma Nov 23, 2019 •

edited

Loading

xush6528 Nov 24, 2019 •

edited

Loading

rohan-varma Nov 24, 2019 •

edited

Loading

xush6528 Nov 24, 2019 •

edited

Loading