-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[distributed] cleanup dist autograd context on other nodes when it is released on one node #27951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Sends an rpc for release context Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
…xt on other nodes" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the current context and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes will not have their contexts released. In follow-up PRs, we can implement bettter cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 91906762 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
…xt on other nodes" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the current context and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes will not have their contexts released. In follow-up PRs, we can implement bettter cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
…xt on other nodes" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the current context and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes will not have their contexts released. In follow-up PRs, we can implement bettter cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
…xt on other nodes" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the current context and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes will not have their contexts released. In follow-up PRs, we can implement bettter cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
…xt on other nodes" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the current context and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes will not have their contexts released. In follow-up PRs, we can implement bettter cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 91918030 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
…nodes when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the current context and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 91918572 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
|
@pritamdamania87 I think this is ready for a first pass look now. some legit failures (https://circleci.com/gh/pytorch/pytorch/3209870?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link) in dist autograd tests since the contexts are being cleaned up across nodes now, so the tests might need some more refactoring. |
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 91959579 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 91968765 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
| // if it exists since it may have been deleted by an in-flight RPC. | ||
| if (DistAutogradContainer::getInstance().hasContextWithId( | ||
| cleanupContextId)) { | ||
| DistAutogradContainer::getInstance().releaseContext( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we probably need to do this check + release atomically, since it is still possible that another thread could release the context after we've done the hasContextWithId check. Then, releaseContext would throw but we don't necessarily want to throw, since the context has been freed as desired. Should we make a version of releaseContext that does not throw if the context we are releasing does not exist (or control this with a param?)
This is not actually an issue in the current implementation, since only the node that created the autograd context sends these RPCs, right now. So each node will receive this RPC only once so there aren't any races possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's do check + release automatically and have a version which doesn't throw if the context is cleaned up (can call it releaseContextIfPresent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, shouldn't notifyWorkers be true here? In case we do nested RPCs, we need this node to notify other nodes too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should eventually be true, but since this PR doesn't implement cycle detection this is a way of not doing additional RPC calls. Once #27576 lands we can modify this to support nested rpc calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to do cycle detection? If we do an atomic check + release, we might ask a node to release again, but that would just be a noop right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is true. Though for performance reasons we'd probably want to avoid the extra RPC caused by asking a node to release again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is true. Though for performance reasons we'd probably want to avoid the extra RPC caused by asking a node to release again.
True, although for now let's just go with a simple version and later on we can evaluate whether we need to optimize this further.
test/dist_autograd_test.py
Outdated
| time.sleep(0.1) | ||
| pass | ||
| # Synchronize all workers. | ||
| rpc.sync_rpc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets use dist.barrier() since sync_rpc might go away in a few future changes we have planned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think dist.barrier() works here (just tried it and the tests fail if we replace it). It appears that dist.barrier() blocks until all the processes enter the barrier. But rpc.sync_rpc() gives us the further guarantee that all outstanding RPCs will be completed, which is what we need here, since we need the RPCs for releasing the context to finish. With dist.barrier() there is no guarantee that all outstanding RPCs will be completed, it looks like. cc @mrshenli in case my understanding isn't correct.
Why is sync_rpc going away? Is there going to be any other way to enforce all processes finish all outstanding RPCs and then move forward (without joining)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm what is the reason dist.barrier doesn't work here? We are only doing sync RPCs in this test so all RPCs should have finished on all nodes when we are done with the barrier.
Why is sync_rpc going away? Is there going to be any other way to enforce all processes finish all outstanding RPCs and then move forward (without joining)?
Because the user should wait on RPCs that they care about and not the framework. If you care about an RPC, you should use future.wait() to check its result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the RPC that is sent to release the context on other nodes isn't synchronous right? It is sent with agent->send() which enqueues the send, and the user doesn't wait on this (i.e. when the context manager is exited, the RPCs are sent, but we don't wait on these RPCs). If we want, we can wait on the future inside releaseContext, but that would be a slowdown to the user.
Since this RPC is not synchronous, there is no guarantee of its finish even if dist.barrier() is called. We can use time.sleep(something_reasonable) to ensure the RPCs are processed, but this seems more likely to be flaky to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case dist.barrier() works since the async RPCs haven't been sent, but in the other test where we sync after exiting the context manager, we would need a way to make sure the RPCs are sent + processed before continuing with the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the user should wait on RPCs that they care about and not the framework.
Agreed. What about the case where our framework itself sends async RPCs that as part of handling a user request? This is basically what these cleanup context RPCs are. The user does not have any handle to the futures created by them, but we should at some point guarantee that they are done.
test/dist_autograd_test.py
Outdated
| ret = rpc.rpc_sync("worker{}".format(dst_rank), torch.add, args=(t1, t2)) | ||
| rpc.rpc_sync("worker{}".format(dst_rank), store_context_id, args=(context_id,)) | ||
| # now let each worker finish with their cleanup RPCs | ||
| rpc.sync_rpc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dist.barrier().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above conversation regarding this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to fix this would be to have loop with a timeout which checks for the context and sleeps for a while if the context still exists. If we exit this loop and the timeout has expired, we fail the test. The timeout could be 10s to be conservative. An example:
start = time.time()
bool success = False
while time.time() - start < 10:
if not dist_autograd.has_context_id(context_id):
success = True
break
self.assertTrue(success)
This structure can also be moved into a helper function that we can reuse in other tests.
| @@ -1,5 +1,6 @@ | |||
| #include <torch/csrc/distributed/autograd/context/dist_autograd_container.h> | |||
| #include <c10/util/Exception.h> | |||
| #include <torch/csrc/distributed/autograd/rpc_messages/cleanup_autograd_context_req.h> | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Try to order includes alphabetically. I thought clang-tidy would catch this? Maybe it doesn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strange, when I run the auto linter it gives me this ordering. If I change the include order there is a linter error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There must be a newline between the top include (for the header corresponding to the source file) and the other includes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll clean this up (and the other files that this is missing) as part of a later diff. Should we add a lint rule for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a clang-format rule, not a clang-tidy rule. We don't lint on clang-format yet (see #25700).
torch/csrc/distributed/autograd/context/dist_autograd_container.cpp
Outdated
Show resolved
Hide resolved
torch/csrc/distributed/autograd/context/dist_autograd_container.cpp
Outdated
Show resolved
Hide resolved
torch/csrc/distributed/autograd/rpc_messages/cleanup_autograd_context_req.cpp
Outdated
Show resolved
Hide resolved
torch/csrc/distributed/autograd/rpc_messages/cleanup_autograd_context_req.cpp
Outdated
Show resolved
Hide resolved
torch/csrc/distributed/autograd/rpc_messages/cleanup_autograd_context_resp.h
Outdated
Show resolved
Hide resolved
Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 92161468 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 92171954 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
…one. Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 92207515 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
| # after dist autograd context is cleaned up, it should be cleaned up on other | ||
| # nodes. This helper allows timeout_seconds for those RPCs to be completed, and | ||
| # ensures that all the contexts have been cleaned up in that timeframe.any | ||
| def _all_contexts_cleaned_up(num_contexts, timeout_seconds=10): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10s should be sufficient.
|
|
||
| // Clean up resources for a given context_id once the autograd pass is done. | ||
| // Sends RPC to other workers this worker knows about, telling them to clean | ||
| // up their context as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets mention that this throws an exception if context_id not found.
| // context. Does nothing if it is not present. | ||
| void releaseContextIfPresent(int64_t context_id); | ||
|
|
||
| // Retrieve the autograd context for a given context_id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two separate methods seems a bit cleaner to me than having a flag.
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
…one. Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 92224583 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
|
Waiting on #28312 to land before landing this, since that is more urgent and it would be delayed by required rebase/test fixes if this went in first. |
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
| self._verify_graph_for_rpc_call_exec(list(send_functions.values())[0]) | ||
| # this barrier is needed so one worker does not clean up their | ||
| # autograd context before another worker tries to access it. | ||
| dist.barrier() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhaojuanmao with these changes, had to add dist.barrier()s here and in other tests so there is no races for the autograd context.
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
…one. Pull Request resolved: #27951 we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. ghstack-source-id: 92259078 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
…one. Pull Request resolved: #27951 we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. ghstack-source-id: 92261890 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
…one. Pull Request resolved: #27951 we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. ghstack-source-id: 92269279 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
|
This pull request has been merged in d9b4788. |
…ne node (pytorch#27951) Summary: Pull Request resolved: pytorch#27951 we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in pytorch#26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. ghstack-source-id: 92269279 Test Plan: Added/modified unit tests in `test/dist_autograd_test.py` Differential Revision: D17920137 fbshipit-source-id: 7403512ab5fcbc28d21c548b2e45319dd472e26a
Stack from ghstack:
Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager
with dist_autograd.context() as context_id: ...).This PR does a few things to implement the above:
request_callback_impl.cpp. When we receive this request, we get the context from a given context_id and release it.DistAutogradContainer::releaseContextto send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in [distributed] add known worker ids to distributed autograd context #26324)In follow up PRs, we will add error checking + retries for this call.
Differential Revision: D17920137