Skip to content

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Nov 22, 2019

Stack from ghstack:

Differential Revision: D18661775

This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times.

ghstack-source-id: 94442415

Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)

[ghstack-poisoned]
This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times.


Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)

[ghstack-poisoned]
@zhaojuanmao
Copy link
Contributor

@rohan-varma and @mrshenli I'm wondering why we need to split into wait_all_workers() and shutdown() in this way if shutdown() always needs to follow by wait_all_workers()? looks like they are not independent.

@mrshenli
Copy link
Contributor

@zhaojuanmao

One reason is that wait_all_workers could become a non-terminating API in the future, sth similar to dist.barrier(). So that applications can do some rpc/remote and then call wait_all_workers for a global sync, and then resume, and then maybe another global sync, and then shutdown. (as mentioned by @satgera) We don't explicitly say we support this yet, but we can discuss if that's what we want.

Resubmit of #30020, which was reverted.
This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times.


Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)

[ghstack-poisoned]
recvWork_->abort();
}
}
threadPool_.waitWorkComplete();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add this here in shutdown, otherwise the test test_rpc_shutdown() would flake with gloo connection reset errors. I don't think it delays the shutdown by too much, as it just waits for already enqueued work to complete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, does this need to be placed before abort? Otherwise, if there are indeed unfinished task, the receiving end listener thread might have already been aborted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, my reasoning for doing it after was that if it was before, we could possibly get more work while aborting the listenerThread, and then enqueue that recv, and not wait for it to be completed. For example, say that recvWork->wait has already been unblocked before we call its abort (via a message from another worker), we may have called enqueueRecv which will add a task to the thread pool.

@zhaojuanmao
Copy link
Contributor

lgtm!

r"""
Block until all local and remote RPC processes reach this method, and then
destroy local the RPC agent. Every RPC process must call this method before
destroy RRef and RPC handlers. Every RPC process must call this method before
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"RPC handlers" -> "Python RPC handlers"

RPC handlers is vague.

Copy link
Contributor Author

@rohan-varma rohan-varma Nov 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring is also in the wrong place now, will update. I actually don't think we need to mention that we destroy RPC handlers and RRef context, those are internal details that should be abstracted away from the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't think we need to mention that we destroy RPC handlers and RRef context, those are internal details that should be abstracted away from the user.

Agree, users don't need to know there is a RRef context and handler

Resubmit of #30020, which was reverted.
This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times.


Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Nov 23, 2019
Pull Request resolved: #30330

This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times.

ghstack-source-id: 94468192
ghstack-source-id: 94468192

Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)
@rohan-varma rohan-varma changed the title [resubmit][rpc] Add local shutdown to process group agent [test all][resubmit][rpc] Add local shutdown to process group agent Nov 23, 2019
>>> rpc.wait_all_workers()
>>> rpc.shutdown()
"""
global _agent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This can come in a followup PR)

We might not need this global _agent any more as we are not modifying it.

recvWork_->abort();
}
}
threadPool_.waitWorkComplete();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, does this need to be placed before abort? Otherwise, if there are indeed unfinished task, the receiving end listener thread might have already been aborted?

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve to unblock. Please wait for all tests to pass.

…oup agent"

[test all]
Resubmit of #30020, which was reverted.
This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times.


Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Nov 23, 2019
Pull Request resolved: #30330

This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times.

ghstack-source-id: 94480867
ghstack-source-id: 94480867

Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)
@rohan-varma rohan-varma changed the title [test all][resubmit][rpc] Add local shutdown to process group agent [resubmit][rpc] Add local shutdown to process group agent Nov 23, 2019
}
threadPool_.waitWorkComplete();
listenerThread_.join();
futureTimeoutCV_.notify_one();
Copy link
Contributor

@xush6528 xush6528 Nov 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the futureTimeoutThread_ is not waiting on the futureTimeoutCV_? For example, it could be at line std::chrono::milliseconds sleepTime;.

Will futureTimeoutThread_ miss the notify_one() and wait on the condition var forever?

I am afraid the shutdown test case will be flaky and some failure instances will be timeouts.

Found a tutorial discussing about how to use predicate to avoid losing notification, https://www.modernescpp.com/index.php/c-core-guidelines-be-aware-of-the-traps-of-condition-variables

Especially the second example, "An atomic predicate", is interesting. It educates us that even when we do rpcRnning_.store(true), we should acquire lock that is used by futureTimeoutCV_.

Copy link
Contributor Author

@rohan-varma rohan-varma Nov 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out - went ahead and made the changes to hold the lock as you suggested.

}

void ProcessGroupAgent::start() {
rpcRunning_.store(true);
Copy link
Contributor

@xush6528 xush6528 Nov 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the futureTimeoutCV_.wait_for(...) part in #30355.

We need to change this line to

{
std::lock_guar<std::mutex> futureLock{futureMutex_};
rpcRunning_.store(true);
}

in #30355 as well.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove stamp until we address @xush6528's comment

[test all]
Resubmit of #30020, which was reverted.
This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times.


Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Nov 24, 2019
Pull Request resolved: #30330

This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times.

ghstack-source-id: 94490200
ghstack-source-id: 94490200

Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)
[test all]
Resubmit of #30020, which was reverted.
This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times.


Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Nov 27, 2019
Pull Request resolved: #30330

This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately.

ghstack-source-id: 94642022
ghstack-source-id: 94642022

Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)
# Run world_size workers.
world_size = 2
for i in range(world_size):
p = mp.Process(target=run_process, args=(i, (i + 1) % 2, world_size))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mp.spawn for the win!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing in separate PR (#30381)


def _check_rpc_done(rank_distance):
while not rpc_done[rank_distance]:
time.sleep(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add a comment here to say that this yields control to other threads so it won't be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address in a follow up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up addressing it in this PR, since I had to do a rebase anyways.

// ProcessGroupAgent::start and unset in ProcessGroupAgent::shutdown and
// ProcessGroupAgent::join. It controls whether several background threads
// should be running.
std::atomic<bool> rpcRunning_{false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, but please document what the expected locking strategy is.

Using an atomic implies (at least to me) that you don't need locks. To reduce the probability of this causing breakage later on, I think a regular bool is better. Accessing that without locks is a red flag in code review.

@rohan-varma
Copy link
Contributor Author

The following are the CI errors, all unrelated:

  1. Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:405-3796e2c20d4587ffdcae63b42490a63316be44bd-android-arm-v7a not found

  2. Nov 27 08:55:40 FAIL [0.032s]: test_neg_cuda (main.TestTorchDeviceTypeCUDA)
    Nov 27 08:55:40 ----------------------------------------------------------------------
    Nov 27 08:55:40 RuntimeError: CUDA error: misaligned address

  3. Nov 27 00:01:04 Traceback (most recent call last):
    Nov 27 00:01:04 File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/soundfile.py", line 142, in
    Nov 27 00:01:04 raise OSError('sndfile library not found')
    Nov 27 00:01:04 OSError: sndfile library not found

  4. 08:34:39 FAIL: test_logical_or_cuda (main.TestTorchDeviceTypeCUDA)
    08:34:39 ----------------------------------------------------------------------
    08:34:39 Traceback (most recent call last):
    08:34:39 File "/var/lib/jenkins/workspace/test/common_utils.py", line 631, in wrapper
    08:34:39 method(*args, **kwargs)
    08:34:39 File "/var/lib/jenkins/workspace/test/common_device_type.py", line 179, in instantiated_test
    08:34:39 return test(self, device_arg)
    08:34:39 File "test_torch.py", line 6578, in test_logical_or
    08:34:39 self._test_logical(device, 'logical_or', [10, 0, 1, 0], [1, 0, 0, 10], [1, 0, 1, 1])
    08:34:39 File "test_torch.py", line 6555, in _test_logical
    08:34:39 self.assertEqual(expected_res.bool(), getattr(a, op)(b))
    08:34:39 File "/var/lib/jenkins/workspace/test/common_utils.py", line 808, in assertEqual
    08:34:39 assertTensorsEqual(x, y)
    08:34:39 File "/var/lib/jenkins/workspace/test/common_utils.py", line 778, in assertTensorsEqual
    08:34:39 self.assertLessEqual(max_err, prec, message)
    08:34:39 AssertionError: tensor(1, device='cuda:0', dtype=torch.int32) not less than or equal to 1e-05 :

[test all]
Resubmit of #30020, which was reverted.
This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times.


Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)

[ghstack-poisoned]
[test all]
Resubmit of #30020, which was reverted.
This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times.


Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Nov 27, 2019
Pull Request resolved: #30330

This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately.

ghstack-source-id: 94673884
ghstack-source-id: 94673884

Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)
@rohan-varma
Copy link
Contributor Author

rohan-varma added a commit that referenced this pull request Dec 4, 2019
Summary:
Pull Request resolved: #30330

This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately.

ghstack-source-id: 94673884
ghstack-source-id: 94673884

Test Plan: Unit tests pass.

Reviewed By: mrshenli

Differential Revision: D18661775

fbshipit-source-id: 5aaa7c14603e18253394224994f6cd43234301c2
rohan-varma added a commit that referenced this pull request Dec 5, 2019
Summary:
Pull Request resolved: #30330

This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately.

ghstack-source-id: 94673884
ghstack-source-id: 94673884

Test Plan: Unit tests pass.

Reviewed By: mrshenli

Differential Revision: D18661775

fbshipit-source-id: 5aaa7c14603e18253394224994f6cd43234301c2
@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/41/head branch December 10, 2019 15:20
facebook-github-bot pushed a commit that referenced this pull request Dec 19, 2019
Summary:
#30330 got rid of the need to send a `MessageType::SHUTDOWN` message, so we can now remove the logic/utils for this type of message.

I think we can also delete the enum entry in the `enum MessageType`, but we may want to keep it in case the logic in #30710 is ever moved to C++.
Pull Request resolved: #31270

Test Plan: All existing unit tests pass

Differential Revision: D19146983

Pulled By: rohan-varma

fbshipit-source-id: 35b185411f9446d7d4dfc37a6cb5477cf041e647
wuhuikx pushed a commit to wuhuikx/pytorch that referenced this pull request Jan 30, 2020
Summary:
Pull Request resolved: pytorch#30330

This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately.

ghstack-source-id: 94673884
ghstack-source-id: 94673884

Test Plan: Unit tests pass.

Reviewed By: mrshenli

Differential Revision: D18661775

fbshipit-source-id: 5aaa7c14603e18253394224994f6cd43234301c2
wuhuikx pushed a commit to wuhuikx/pytorch that referenced this pull request Jan 30, 2020
Summary:
pytorch#30330 got rid of the need to send a `MessageType::SHUTDOWN` message, so we can now remove the logic/utils for this type of message.

I think we can also delete the enum entry in the `enum MessageType`, but we may want to keep it in case the logic in pytorch#30710 is ever moved to C++.
Pull Request resolved: pytorch#31270

Test Plan: All existing unit tests pass

Differential Revision: D19146983

Pulled By: rohan-varma

fbshipit-source-id: 35b185411f9446d7d4dfc37a6cb5477cf041e647
rohan-varma added a commit that referenced this pull request Apr 7, 2020
…() in ProcessGroupAgent::listenLoop"

ungraceful shutdown
ungraceful shutdown

#30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.

Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 7, 2020
…() in ProcessGroupAgent::listenLoop"

ungraceful shutdown
ungraceful shutdown
ungraceful shutdown

#30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.

Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 7, 2020
…ssGroupAgent::listenLoop

Pull Request resolved: #36084

#30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.
ghstack-source-id: 101645227

Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!
rohan-varma added a commit that referenced this pull request Apr 7, 2020
…o RecvWork::wait() in ProcessGroupAgent::listenLoop"

ungraceful shutdown
ungraceful shutdown
ungraceful shutdown
ungraceful shutdown

#30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.

Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 7, 2020
…() in ProcessGroupAgent::listenLoop"

ungraceful shutdown
ungraceful shutdown
ungraceful shutdown
ungraceful shutdown

#30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.

Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 7, 2020
…ssGroupAgent::listenLoop

Pull Request resolved: #36084

#30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.
ghstack-source-id: 101689402

Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!
facebook-github-bot pushed a commit that referenced this pull request Apr 8, 2020
…ssGroupAgent::listenLoop (#36084)

Summary:
Pull Request resolved: #36084

#30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.
ghstack-source-id: 101689402

Test Plan:
Added test in ProcessGroupAgentTest. We also add a basic config that allows us to control whether to abort the call to `pg->recv()` and `pg->recvAnysource()` in `FailingWaitProcessGroupGloo`.

Run test binary:
```buck build mode/dev-nosan //caffe2/torch/fb/distributed/thriftRpcBackend/test:ProcessGroupAgentTest --keep-going
~/fbcode/buck-out/gen/caffe2/torch/fb/distributed/thriftRpcBackend/test/ProcessGroupAgentTest
```
P128567144

Differential Revision: D20632764

fbshipit-source-id: c0b3c391fd3e0ae711661ad99f309ee4d93f6582
ashishfarmer pushed a commit to ashishfarmer/pytorch that referenced this pull request Apr 13, 2020
…ssGroupAgent::listenLoop (pytorch#36084)

Summary:
Pull Request resolved: pytorch#36084

pytorch#30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.
ghstack-source-id: 101689402

Test Plan:
Added test in ProcessGroupAgentTest. We also add a basic config that allows us to control whether to abort the call to `pg->recv()` and `pg->recvAnysource()` in `FailingWaitProcessGroupGloo`.

Run test binary:
```buck build mode/dev-nosan //caffe2/torch/fb/distributed/thriftRpcBackend/test:ProcessGroupAgentTest --keep-going
~/fbcode/buck-out/gen/caffe2/torch/fb/distributed/thriftRpcBackend/test/ProcessGroupAgentTest
```
P128567144

Differential Revision: D20632764

fbshipit-source-id: c0b3c391fd3e0ae711661ad99f309ee4d93f6582
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants