Skip to content

Conversation

@rohan-varma
Copy link
Contributor

Summary:
Closes #227.
Adds 2 methods: abortWaitSend, abortWaitRecv so that we can interrupt a blocking waitRecv or waitSend, per pietern 's suggestion. This is useful if a gloo process is waiting for a message in one thread, but another thread decides to shutdown and wants to stop it gracefully.
The methods set a boolean variable and notify a waiting condition variable. The condition variable predicate is changed to check this condition. The implementation is both for the tcp and uv transport methods. Unit tests are also added.

Differential Revision: D18321454

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D18321454

rohan-varma added a commit to rohan-varma/gloo that referenced this pull request Nov 5, 2019
Summary:
Pull Request resolved: pytorch#232

Closes pytorch#227.
Adds 2 methods: abortWaitSend, abortWaitRecv so that we can interrupt a blocking waitRecv or waitSend, per pietern 's suggestion. This is useful if a gloo process is waiting for a message in one thread, but another thread decides to shutdown and wants to stop it gracefully.
The methods set a boolean variable and notify a waiting condition variable. The condition variable predicate is changed to check this condition. The implementation is both for the tcp and uv transport methods. Unit tests are also added.

Differential Revision: D18321454

fbshipit-source-id: fb75eedcbec46c2878e43209170f72ef6b48ce70
@xush6528
Copy link

xush6528 commented Nov 5, 2019

terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /root/project/gloo/transport/uv/pair.cc:234] op.buf. Cannot lock pointer to unbound buffer
Received 'aborted' signal

Error is relevant.

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D18321454

Summary:
Pull Request resolved: pytorch#232

Closes pytorch#227.
Adds 2 methods: abortWaitSend, abortWaitRecv so that we can interrupt a blocking waitRecv or waitSend, per pietern 's suggestion. This is useful if a gloo process is waiting for a message in one thread, but another thread decides to shutdown and wants to stop it gracefully.
The methods set a boolean variable and notify a waiting condition variable. The condition variable predicate is changed to check this condition. The implementation is both for the tcp and uv transport methods. Unit tests are also added.

Differential Revision: D18321454

fbshipit-source-id: 2b0b5af02cf266fd49a4b78d722ff7f2be47bd15
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D18321454

@rohan-varma
Copy link
Contributor Author

terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /root/project/gloo/transport/uv/pair.cc:234] op.buf. Cannot lock pointer to unbound buffer
Received 'aborted' signal

Error is relevant.

It looks like this has been fixed by adding buf->waitRecv(); to the AbortRecv test.

@rohan-varma rohan-varma requested a review from pietern November 5, 2019 19:54
Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @rohan-varma. LGTM.

@facebook-github-bot
Copy link

This pull request has been merged in 7c54124.

facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request Nov 6, 2019
Summary:
Update gloo submodule to use the new APIs introduced in pytorch/gloo#232. Done by `cd third_party/gloo && git checkout 7c54124` which is gloo's latest commit.

Next step would be to consume the introduced APIs in `ProcessGroup::Work`. Then we can use this layer to be able to interrupt `ProcessGroupAgent` (only for the gloo backend).
Pull Request resolved: #29248

Reviewed By: xush6528

Differential Revision: D18350654

Pulled By: rohan-varma

fbshipit-source-id: e41f7446bbb500087a0ca3919173b2e8379c7ce7
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 18, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 18, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 18, 2019
…gent"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 19, 2019
…gent"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 19, 2019
…gent"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 19, 2019
…gent"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 19, 2019
…gent"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 19, 2019
…gent"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 19, 2019
…gent"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 20, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 20, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 20, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 20, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 20, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 20, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 21, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 21, 2019
…agent"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 21, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 21, 2019
…agent"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 21, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 21, 2019
…agent"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 21, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 21, 2019
…agent"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 21, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 21, 2019
…agent"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 21, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 22, 2019
…agent"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 22, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 22, 2019
…agent"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 22, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 22, 2019
…agent"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 22, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 22, 2019
…agent"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 22, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 22, 2019
…agent"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 22, 2019
After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `localShutdown` method aborts this thread and joins the other threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid statem.

A destructor is also added which simply calls `localShutdown`, and we ensure that this function is not called multiple  times.

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 22, 2019
…up agent and refactor join"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `shutdown` method aborts this and joins the all the threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid state. The new way of shutting down is:

```
rpc.wait_all_workers()
rpc.shutdown()
```

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
rohan-varma added a commit to pytorch/pytorch that referenced this pull request Nov 22, 2019
…actor join"


After changes in the previous PR (#29928) and the gloo code (pytorch/gloo#232), we can now interrupt the listener thread in process group agent, which was blocked on `wait`ing for some work to be received. The `shutdown` method aborts this and joins the all the threads. A condition variable is also added to ensure that we don't abort the thread while `recvWork` is in an invalid state. The new way of shutting down is:

```
rpc.wait_all_workers()
rpc.shutdown()
```

Differential Revision: [D5578006](https://our.internmc.facebook.com/intern/diff/D5578006/)

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Interrupt waitSend/waitRecv

4 participants