-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[rpc] allow ability to abort second call to RecvWork::wait() in ProcessGroupAgent::listenLoop #36084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ssGroupAgent::listenLoop Added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)! [ghstack-poisoned]
…() in ProcessGroupAgent::listenLoop" ungraceful shutdown Added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)! [ghstack-poisoned]
…ssGroupAgent::listenLoop Pull Request resolved: #36084 Added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. ghstack-source-id: 101604969 Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!
💊 CircleCI build failures summary and remediationsAs of commit 7df8436 (more details on the Dr. CI page):
🕵️ 2 new failures recognized by patternsThe following build failures do not appear to be due to upstream breakages:
|
…() in ProcessGroupAgent::listenLoop" ungraceful shutdown ungraceful shutdown #30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)! [ghstack-poisoned]
…() in ProcessGroupAgent::listenLoop" ungraceful shutdown ungraceful shutdown ungraceful shutdown #30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)! [ghstack-poisoned]
…ssGroupAgent::listenLoop Pull Request resolved: #36084 #30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. ghstack-source-id: 101645227 Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!
…() in ProcessGroupAgent::listenLoop" ungraceful shutdown ungraceful shutdown ungraceful shutdown ungraceful shutdown #30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)! [ghstack-poisoned]
…ssGroupAgent::listenLoop Pull Request resolved: #36084 #30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. ghstack-source-id: 101689402 Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!
|
This pull request has been merged in fc5d658. |
…ssGroupAgent::listenLoop (pytorch#36084) Summary: Pull Request resolved: pytorch#36084 pytorch#30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown. Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens. ghstack-source-id: 101689402 Test Plan: Added test in ProcessGroupAgentTest. We also add a basic config that allows us to control whether to abort the call to `pg->recv()` and `pg->recvAnysource()` in `FailingWaitProcessGroupGloo`. Run test binary: ```buck build mode/dev-nosan //caffe2/torch/fb/distributed/thriftRpcBackend/test:ProcessGroupAgentTest --keep-going ~/fbcode/buck-out/gen/caffe2/torch/fb/distributed/thriftRpcBackend/test/ProcessGroupAgentTest ``` P128567144 Differential Revision: D20632764 fbshipit-source-id: c0b3c391fd3e0ae711661ad99f309ee4d93f6582
Stack from ghstack:
ungraceful shutdown
ungraceful shutdown
ungraceful shutdown
ungraceful shutdown
#30330 added support to abort the call to a
RecvWorkcreated byrecvAnysourcebut there is an additional call topg_->recv()to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in::shutdown(), which can be used to avoid hangs during ungraceful shutdown.Added an internal test case in
ProcessGroupAgentTestto ensure that an appropriate error message is raised when this happens.Differential Revision: D20632764
NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!