-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[resubmit][rpc] Add local shutdown to process group agent #30330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. ghstack-source-id: 94442415 Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/) [ghstack-poisoned]
This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/) [ghstack-poisoned]
|
@rohan-varma and @mrshenli I'm wondering why we need to split into wait_all_workers() and shutdown() in this way if shutdown() always needs to follow by wait_all_workers()? looks like they are not independent. |
|
One reason is that |
Resubmit of #30020, which was reverted. This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/) [ghstack-poisoned]
| recvWork_->abort(); | ||
| } | ||
| } | ||
| threadPool_.waitWorkComplete(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to add this here in shutdown, otherwise the test test_rpc_shutdown() would flake with gloo connection reset errors. I don't think it delays the shutdown by too much, as it just waits for already enqueued work to complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, does this need to be placed before abort? Otherwise, if there are indeed unfinished task, the receiving end listener thread might have already been aborted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, my reasoning for doing it after was that if it was before, we could possibly get more work while aborting the listenerThread, and then enqueue that recv, and not wait for it to be completed. For example, say that recvWork->wait has already been unblocked before we call its abort (via a message from another worker), we may have called enqueueRecv which will add a task to the thread pool.
|
lgtm! |
torch/distributed/rpc/api.py
Outdated
| r""" | ||
| Block until all local and remote RPC processes reach this method, and then | ||
| destroy local the RPC agent. Every RPC process must call this method before | ||
| destroy RRef and RPC handlers. Every RPC process must call this method before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"RPC handlers" -> "Python RPC handlers"
RPC handlers is vague.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring is also in the wrong place now, will update. I actually don't think we need to mention that we destroy RPC handlers and RRef context, those are internal details that should be abstracted away from the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't think we need to mention that we destroy RPC handlers and RRef context, those are internal details that should be abstracted away from the user.
Agree, users don't need to know there is a RRef context and handler
Resubmit of #30020, which was reverted. This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/) [ghstack-poisoned]
Pull Request resolved: #30330 This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. ghstack-source-id: 94468192 ghstack-source-id: 94468192 Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)
| >>> rpc.wait_all_workers() | ||
| >>> rpc.shutdown() | ||
| """ | ||
| global _agent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This can come in a followup PR)
We might not need this global _agent any more as we are not modifying it.
| recvWork_->abort(); | ||
| } | ||
| } | ||
| threadPool_.waitWorkComplete(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, does this need to be placed before abort? Otherwise, if there are indeed unfinished task, the receiving end listener thread might have already been aborted?
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve to unblock. Please wait for all tests to pass.
…oup agent" [test all] Resubmit of #30020, which was reverted. This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/) [ghstack-poisoned]
Pull Request resolved: #30330 This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. ghstack-source-id: 94480867 ghstack-source-id: 94480867 Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)
| } | ||
| threadPool_.waitWorkComplete(); | ||
| listenerThread_.join(); | ||
| futureTimeoutCV_.notify_one(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the futureTimeoutThread_ is not waiting on the futureTimeoutCV_? For example, it could be at line std::chrono::milliseconds sleepTime;.
Will futureTimeoutThread_ miss the notify_one() and wait on the condition var forever?
I am afraid the shutdown test case will be flaky and some failure instances will be timeouts.
Found a tutorial discussing about how to use predicate to avoid losing notification, https://www.modernescpp.com/index.php/c-core-guidelines-be-aware-of-the-traps-of-condition-variables
Especially the second example, "An atomic predicate", is interesting. It educates us that even when we do rpcRnning_.store(true), we should acquire lock that is used by futureTimeoutCV_.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out - went ahead and made the changes to hold the lock as you suggested.
| } | ||
|
|
||
| void ProcessGroupAgent::start() { | ||
| rpcRunning_.store(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove stamp until we address @xush6528's comment
[test all] Resubmit of #30020, which was reverted. This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/) [ghstack-poisoned]
Pull Request resolved: #30330 This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. ghstack-source-id: 94490200 ghstack-source-id: 94490200 Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)
[test all] Resubmit of #30020, which was reverted. This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/) [ghstack-poisoned]
Pull Request resolved: #30330 This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately. ghstack-source-id: 94642022 ghstack-source-id: 94642022 Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)
| # Run world_size workers. | ||
| world_size = 2 | ||
| for i in range(world_size): | ||
| p = mp.Process(target=run_process, args=(i, (i + 1) % 2, world_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mp.spawn for the win!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressing in separate PR (#30381)
|
|
||
| def _check_rpc_done(rank_distance): | ||
| while not rpc_done[rank_distance]: | ||
| time.sleep(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could add a comment here to say that this yields control to other threads so it won't be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will address in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up addressing it in this PR, since I had to do a rebase anyways.
| // ProcessGroupAgent::start and unset in ProcessGroupAgent::shutdown and | ||
| // ProcessGroupAgent::join. It controls whether several background threads | ||
| // should be running. | ||
| std::atomic<bool> rpcRunning_{false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, but please document what the expected locking strategy is.
Using an atomic implies (at least to me) that you don't need locks. To reduce the probability of this causing breakage later on, I think a regular bool is better. Accessing that without locks is a red flag in code review.
|
The following are the CI errors, all unrelated:
|
[test all] Resubmit of #30020, which was reverted. This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/) [ghstack-poisoned]
[test all] Resubmit of #30020, which was reverted. This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/) [ghstack-poisoned]
Pull Request resolved: #30330 This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately. ghstack-source-id: 94673884 ghstack-source-id: 94673884 Differential Revision: [D18661775](https://our.internmc.facebook.com/intern/diff/D18661775/)
|
CI failure that is not listed as "broken upstream" has occured on other merged PRs (https://app.circleci.com/jobs/github/pytorch/pytorch/3733973, https://app.circleci.com/jobs/github/pytorch/pytorch/3734258), it should be fixed by this revert: a2ed50c |
Summary: Pull Request resolved: #30330 This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately. ghstack-source-id: 94673884 ghstack-source-id: 94673884 Test Plan: Unit tests pass. Reviewed By: mrshenli Differential Revision: D18661775 fbshipit-source-id: 5aaa7c14603e18253394224994f6cd43234301c2
Summary: Pull Request resolved: #30330 This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately. ghstack-source-id: 94673884 ghstack-source-id: 94673884 Test Plan: Unit tests pass. Reviewed By: mrshenli Differential Revision: D18661775 fbshipit-source-id: 5aaa7c14603e18253394224994f6cd43234301c2
Summary: #30330 got rid of the need to send a `MessageType::SHUTDOWN` message, so we can now remove the logic/utils for this type of message. I think we can also delete the enum entry in the `enum MessageType`, but we may want to keep it in case the logic in #30710 is ever moved to C++. Pull Request resolved: #31270 Test Plan: All existing unit tests pass Differential Revision: D19146983 Pulled By: rohan-varma fbshipit-source-id: 35b185411f9446d7d4dfc37a6cb5477cf041e647
Summary: Pull Request resolved: pytorch#30330 This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately. ghstack-source-id: 94673884 ghstack-source-id: 94673884 Test Plan: Unit tests pass. Reviewed By: mrshenli Differential Revision: D18661775 fbshipit-source-id: 5aaa7c14603e18253394224994f6cd43234301c2
Summary: pytorch#30330 got rid of the need to send a `MessageType::SHUTDOWN` message, so we can now remove the logic/utils for this type of message. I think we can also delete the enum entry in the `enum MessageType`, but we may want to keep it in case the logic in pytorch#30710 is ever moved to C++. Pull Request resolved: pytorch#31270 Test Plan: All existing unit tests pass Differential Revision: D19146983 Pulled By: rohan-varma fbshipit-source-id: 35b185411f9446d7d4dfc37a6cb5477cf041e647
…() in ProcessGroupAgent::listenLoop" ungraceful shutdown ungraceful shutdown #30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)! [ghstack-poisoned]
…() in ProcessGroupAgent::listenLoop" ungraceful shutdown ungraceful shutdown ungraceful shutdown #30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)! [ghstack-poisoned]
…ssGroupAgent::listenLoop Pull Request resolved: #36084 #30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. ghstack-source-id: 101645227 Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!
…o RecvWork::wait() in ProcessGroupAgent::listenLoop" ungraceful shutdown ungraceful shutdown ungraceful shutdown ungraceful shutdown #30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)! [ghstack-poisoned]
…() in ProcessGroupAgent::listenLoop" ungraceful shutdown ungraceful shutdown ungraceful shutdown ungraceful shutdown #30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)! [ghstack-poisoned]
…ssGroupAgent::listenLoop Pull Request resolved: #36084 #30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. ghstack-source-id: 101689402 Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!
…ssGroupAgent::listenLoop (#36084) Summary: Pull Request resolved: #36084 #30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. ghstack-source-id: 101689402 Test Plan: Added test in ProcessGroupAgentTest. We also add a basic config that allows us to control whether to abort the call to `pg->recv()` and `pg->recvAnysource()` in `FailingWaitProcessGroupGloo`. Run test binary: ```buck build mode/dev-nosan //caffe2/torch/fb/distributed/thriftRpcBackend/test:ProcessGroupAgentTest --keep-going ~/fbcode/buck-out/gen/caffe2/torch/fb/distributed/thriftRpcBackend/test/ProcessGroupAgentTest ``` P128567144 Differential Revision: D20632764 fbshipit-source-id: c0b3c391fd3e0ae711661ad99f309ee4d93f6582
…ssGroupAgent::listenLoop (pytorch#36084) Summary: Pull Request resolved: pytorch#36084 pytorch#30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. ghstack-source-id: 101689402 Test Plan: Added test in ProcessGroupAgentTest. We also add a basic config that allows us to control whether to abort the call to `pg->recv()` and `pg->recvAnysource()` in `FailingWaitProcessGroupGloo`. Run test binary: ```buck build mode/dev-nosan //caffe2/torch/fb/distributed/thriftRpcBackend/test:ProcessGroupAgentTest --keep-going ~/fbcode/buck-out/gen/caffe2/torch/fb/distributed/thriftRpcBackend/test/ProcessGroupAgentTest ``` P128567144 Differential Revision: D20632764 fbshipit-source-id: c0b3c391fd3e0ae711661ad99f309ee4d93f6582
Stack from ghstack:
[test all]
Resubmit of [rpc] Add shutdown function to process group agent and refactor join #30020, which was reverted.
This is now possible due to previous changes made in
glooandProcessGroupGloo. Weabortthe listener thread that is waiting for a message, and join all other threads. The destructor calls this samelocalShutdownmethod, but we ensure this is not called multiple times.Differential Revision: D18661775