[distributed] cleanup dist autograd context on other nodes when it is released on one node #27951

rohan-varma · 2019-10-15T00:19:09Z

Stack from ghstack:

[distributed] cleanup dist autograd context on other nodes when it is released on one node #27951 [distributed] cleanup dist autograd context on other nodes when it is released on one node

Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager with dist_autograd.context() as context_id: ...).

This PR does a few things to implement the above:

Add classes to encapsulate messages for requesting this context release and the response
Handling of this request in request_callback_impl.cpp. When we receive this request, we get the context from a given context_id and release it.
RPC call in DistAutogradContainer::releaseContext to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in [distributed] add known worker ids to distributed autograd context #26324)
Relevant unit tests

In follow up PRs, we will add error checking + retries for this call.
Differential Revision: D17920137

Sends an rpc for release context Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]

…xt on other nodes" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the current context and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes will not have their contexts released. In follow-up PRs, we can implement bettter cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]

Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 91906762 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

…xt on other nodes" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the current context and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes will not have their contexts released. In follow-up PRs, we can implement bettter cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]

Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 91918030 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

…nodes when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the current context and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]

Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 91918572 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

rohan-varma · 2019-10-15T17:43:16Z

@pritamdamania87 I think this is ready for a first pass look now. some legit failures (https://circleci.com/gh/pytorch/pytorch/3209870?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link) in dist autograd tests since the contexts are being cleaned up across nodes now, so the tests might need some more refactoring.

… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]

Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 91959579 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]

Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 91968765 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

rohan-varma · 2019-10-15T21:59:55Z

torch/csrc/distributed/rpc/request_callback_impl.cpp

+      // if it exists since it may have been deleted by an in-flight RPC.
+      if (DistAutogradContainer::getInstance().hasContextWithId(
+              cleanupContextId)) {
+        DistAutogradContainer::getInstance().releaseContext(


Actually, we probably need to do this check + release atomically, since it is still possible that another thread could release the context after we've done the hasContextWithId check. Then, releaseContext would throw but we don't necessarily want to throw, since the context has been freed as desired. Should we make a version of releaseContext that does not throw if the context we are releasing does not exist (or control this with a param?)

This is not actually an issue in the current implementation, since only the node that created the autograd context sends these RPCs, right now. So each node will receive this RPC only once so there aren't any races possible.

Yes, let's do check + release automatically and have a version which doesn't throw if the context is cleaned up (can call it releaseContextIfPresent).

Also, shouldn't notifyWorkers be true here? In case we do nested RPCs, we need this node to notify other nodes too.

It should eventually be true, but since this PR doesn't implement cycle detection this is a way of not doing additional RPC calls. Once #27576 lands we can modify this to support nested rpc calls.

Do we need to do cycle detection? If we do an atomic check + release, we might ask a node to release again, but that would just be a noop right?

Yes, this is true. Though for performance reasons we'd probably want to avoid the extra RPC caused by asking a node to release again.

Yes, this is true. Though for performance reasons we'd probably want to avoid the extra RPC caused by asking a node to release again.

True, although for now let's just go with a simple version and later on we can evaluate whether we need to optimize this further.

test/dist_autograd_test.py

pritamdamania87 · 2019-10-15T23:34:15Z

test/dist_autograd_test.py

-                time.sleep(0.1)
-                pass
+            # Synchronize all workers.
+            rpc.sync_rpc()


Lets use dist.barrier() since sync_rpc might go away in a few future changes we have planned.

I don't think dist.barrier() works here (just tried it and the tests fail if we replace it). It appears that dist.barrier() blocks until all the processes enter the barrier. But rpc.sync_rpc() gives us the further guarantee that all outstanding RPCs will be completed, which is what we need here, since we need the RPCs for releasing the context to finish. With dist.barrier() there is no guarantee that all outstanding RPCs will be completed, it looks like. cc @mrshenli in case my understanding isn't correct.

Why is sync_rpc going away? Is there going to be any other way to enforce all processes finish all outstanding RPCs and then move forward (without joining)?

Hmm what is the reason dist.barrier doesn't work here? We are only doing sync RPCs in this test so all RPCs should have finished on all nodes when we are done with the barrier.

Why is sync_rpc going away? Is there going to be any other way to enforce all processes finish all outstanding RPCs and then move forward (without joining)?

Because the user should wait on RPCs that they care about and not the framework. If you care about an RPC, you should use future.wait() to check its result.

But the RPC that is sent to release the context on other nodes isn't synchronous right? It is sent with agent->send() which enqueues the send, and the user doesn't wait on this (i.e. when the context manager is exited, the RPCs are sent, but we don't wait on these RPCs). If we want, we can wait on the future inside releaseContext, but that would be a slowdown to the user.

Since this RPC is not synchronous, there is no guarantee of its finish even if dist.barrier() is called. We can use time.sleep(something_reasonable) to ensure the RPCs are processed, but this seems more likely to be flaky to me

In this particular case dist.barrier() works since the async RPCs haven't been sent, but in the other test where we sync after exiting the context manager, we would need a way to make sure the RPCs are sent + processed before continuing with the test.

Because the user should wait on RPCs that they care about and not the framework.

Agreed. What about the case where our framework itself sends async RPCs that as part of handling a user request? This is basically what these cleanup context RPCs are. The user does not have any handle to the futures created by them, but we should at some point guarantee that they are done.

pritamdamania87 · 2019-10-15T23:34:51Z

test/dist_autograd_test.py

+                ret = rpc.rpc_sync("worker{}".format(dst_rank), torch.add, args=(t1, t2))
+                rpc.rpc_sync("worker{}".format(dst_rank), store_context_id, args=(context_id,))
+        # now let each worker finish with their cleanup RPCs
+        rpc.sync_rpc()


dist.barrier().

See above conversation regarding this

One way to fix this would be to have loop with a timeout which checks for the context and sleeps for a while if the context still exists. If we exit this loop and the timeout has expired, we fail the test. The timeout could be 10s to be conservative. An example:

start = time.time() bool success = False while time.time() - start < 10: if not dist_autograd.has_context_id(context_id): success = True break self.assertTrue(success)

This structure can also be moved into a helper function that we can reuse in other tests.

pritamdamania87 · 2019-10-15T23:39:04Z

torch/csrc/distributed/autograd/context/dist_autograd_container.cpp

@@ -1,5 +1,6 @@
 #include <torch/csrc/distributed/autograd/context/dist_autograd_container.h>
 #include <c10/util/Exception.h>
+#include <torch/csrc/distributed/autograd/rpc_messages/cleanup_autograd_context_req.h>


nit: Try to order includes alphabetically. I thought clang-tidy would catch this? Maybe it doesn't?

This is strange, when I run the auto linter it gives me this ordering. If I change the include order there is a linter error.

There must be a newline between the top include (for the header corresponding to the source file) and the other includes.

I'll clean this up (and the other files that this is missing) as part of a later diff. Should we add a lint rule for this?

It's a clang-format rule, not a clang-tidy rule. We don't lint on clang-format yet (see #25700).

torch/csrc/distributed/autograd/context/dist_autograd_container.cpp

test/dist_autograd_test.py

torch/csrc/distributed/autograd/rpc_messages/cleanup_autograd_context_req.cpp

torch/csrc/distributed/autograd/rpc_messages/cleanup_autograd_context_resp.h

Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 92161468 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]

Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 92171954 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]

…one. Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 92207515 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

pritamdamania87 · 2019-10-18T19:32:53Z

test/dist_autograd_test.py

+# after dist autograd context is cleaned up, it should be cleaned up on other
+# nodes. This helper allows timeout_seconds for those RPCs to be completed, and
+# ensures that all the contexts have been cleaned up in that timeframe.any
+def _all_contexts_cleaned_up(num_contexts, timeout_seconds=10):


10s should be sufficient.

pritamdamania87 · 2019-10-18T19:34:40Z

torch/csrc/distributed/autograd/context/dist_autograd_container.h


  // Clean up resources for a given context_id once the autograd pass is done.
+  // Sends RPC to other workers this worker knows about, telling them to clean
+  // up their context as well.


Lets mention that this throws an exception if context_id not found.

pritamdamania87 · 2019-10-18T19:34:58Z

torch/csrc/distributed/autograd/context/dist_autograd_container.h

+  // context. Does nothing if it is not present.
+  void releaseContextIfPresent(int64_t context_id);
+
  // Retrieve the autograd context for a given context_id.


Two separate methods seems a bit cleaner to me than having a flag.

… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]

…one. Pull Request resolved: #27951 Sends an rpc for release context ghstack-source-id: 92224583 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

rohan-varma · 2019-10-19T02:33:03Z

Waiting on #28312 to land before landing this, since that is more urgent and it would be delayed by required rebase/test fixes if this went in first.

… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]

rohan-varma · 2019-10-19T23:42:26Z

test/dist_autograd_test.py

            self._verify_graph_for_rpc_call_exec(list(send_functions.values())[0])
+            # this barrier is needed so one worker does not clean up their
+            # autograd context before another worker tries to access it.
+            dist.barrier()


@zhaojuanmao with these changes, had to add dist.barrier()s here and in other tests so there is no races for the autograd context.

… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]

…one. Pull Request resolved: #27951 we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. ghstack-source-id: 92259078 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]

…one. Pull Request resolved: #27951 we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. ghstack-source-id: 92261890 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]

…one. Pull Request resolved: #27951 we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. ghstack-source-id: 92269279 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

facebook-github-bot · 2019-10-21T16:05:44Z

This pull request has been merged in d9b4788.

…ne node (pytorch#27951) Summary: Pull Request resolved: pytorch#27951 we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in pytorch#26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. ghstack-source-id: 92269279 Test Plan: Added/modified unit tests in `test/dist_autograd_test.py` Differential Revision: D17920137 fbshipit-source-id: 7403512ab5fcbc28d21c548b2e45319dd472e26a

[wip][distributed] send rpc for releaseContext

c5e8118

Sends an rpc for release context Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]

rohan-varma requested review from mrshenli and pietern as code owners October 15, 2019 00:19

rohan-varma changed the title ~~[wip][distributed] send rpc for releaseContext~~ [WIP][distributed] send rpc to release dist autograd context on other nodes Oct 15, 2019

rohan-varma changed the title ~~[WIP][distributed] send rpc to release dist autograd context on other nodes~~ [WIP][distributed] cleanup dist autograd context on other nodes when it is released on one node Oct 15, 2019

rohan-varma requested review from pritamdamania87, xush6528 and zhaojuanmao October 15, 2019 05:58

rohan-varma changed the title ~~[WIP][distributed] cleanup dist autograd context on other nodes when it is released on one node~~ [distributed] cleanup dist autograd context on other nodes when it is released on one node Oct 15, 2019

rohan-varma commented Oct 15, 2019

View reviewed changes

pritamdamania87 suggested changes Oct 15, 2019

View reviewed changes

rohan-varma mentioned this pull request Oct 16, 2019

Add autograd hook for python rpc call #27576

Closed

pritamdamania87 reviewed Oct 18, 2019

View reviewed changes

rohan-varma commented Oct 19, 2019

View reviewed changes

facebook-github-bot closed this in d9b4788 Oct 21, 2019

facebook-github-bot added the merged label Oct 21, 2019

facebook-github-bot deleted the gh/rohan-varma/20/head branch October 28, 2019 22:19

mruberry added the Merged label Oct 28, 2020

[distributed] cleanup dist autograd context on other nodes when it is released on one node #27951

[distributed] cleanup dist autograd context on other nodes when it is released on one node #27951

Uh oh!

Conversation

rohan-varma commented Oct 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rohan-varma commented Oct 15, 2019

Uh oh!

rohan-varma Oct 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohan-varma Oct 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohan-varma Oct 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohan-varma commented Oct 19, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 21, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

rohan-varma commented Oct 15, 2019 •

edited

Loading

rohan-varma Oct 15, 2019 •

edited

Loading

rohan-varma Oct 16, 2019 •

edited

Loading

rohan-varma Oct 16, 2019 •

edited

Loading