-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Properly shutdown RPC even in the case of clean_shutdown=False.
#29148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly shutdown RPC even in the case of clean_shutdown=False.
#29148
Conversation
We would skip rpc.join_rpc() in the case of `clean_shutdown=False`. This would exit the process without properly cleaning up the local RPCAgent resulting in a crash. As a result, to fix this we still call rpc.join_rpc() even in an unclean shutdown. Note that, rpc.join_rpc() needs to be replaced with a local `shutdown` call eventually since we need a way to shutdown the local RPC agent properly. Differential Revision: [D18306941](https://our.internmc.facebook.com/intern/diff/D18306941/) [ghstack-poisoned]
We would skip rpc.join_rpc() in the case of `clean_shutdown=False`. This would exit the process without properly cleaning up the local RPCAgent resulting in a crash. As a result, to fix this we still call rpc.join_rpc() even in an unclean shutdown. Note that, rpc.join_rpc() needs to be replaced with a local `shutdown` call eventually since we need a way to shutdown the local RPC agent properly. Differential Revision: [D18306941](https://our.internmc.facebook.com/intern/diff/D18306941/) ghstack-source-id: 93216133 Pull Request resolved: #29148
This should be unblocked once pytorch/gloo#228 is landed. |
…False`." We would skip rpc.join_rpc() in the case of `clean_shutdown=False`. This would exit the process without properly cleaning up the local RPCAgent resulting in a crash. As a result, to fix this we still call rpc.join_rpc() even in an unclean shutdown. Note that, rpc.join_rpc() needs to be replaced with a local `shutdown` call eventually since we need a way to shutdown the local RPC agent properly. Differential Revision: [D18306941](https://our.internmc.facebook.com/intern/diff/D18306941/) [ghstack-poisoned]
Pull Request resolved: #29148 We would skip rpc.join_rpc() in the case of `clean_shutdown=False`. This would exit the process without properly cleaning up the local RPCAgent resulting in a crash. As a result, to fix this we still call rpc.join_rpc() even in an unclean shutdown. Note that, rpc.join_rpc() needs to be replaced with a local `shutdown` call eventually since we need a way to shutdown the local RPC agent properly. Differential Revision: [D18306941](https://our.internmc.facebook.com/intern/diff/D18306941/) ghstack-source-id: 93561508
| # since we need to shutdown the RPC agent. If we don't shutdown the | ||
| # RPC agent, tests would fail since RPC agent threads, locks and | ||
| # condition variables are not properly terminated. | ||
| rpc.join_rpc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, replace this with the local shutdown call after the shutdown API is ready.
| fut = rpc.rpc_async(dst_name, set_termination_signal, args=()) | ||
| futs.append(fut) | ||
| for fut in futs: | ||
| assert fut.wait() is None, "Sending termination signal failed." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And replace this part with the rpc.all_shutdown() API to be added.
|
This pull request has been merged in 310343e. |
Stack from ghstack:
clean_shutdown=False. #29148 Properly shutdown RPC even in the case ofclean_shutdown=False.We would skip rpc.join_rpc() in the case of
clean_shutdown=False.This would exit the process without properly cleaning up the local RPCAgent
resulting in a crash.
As a result, to fix this we still call rpc.join_rpc() even in an unclean
shutdown. Note that, rpc.join_rpc() needs to be replaced with a local
shutdowncall eventually since we need a way to shutdown the local RPC agentproperly.
Differential Revision: D18306941