-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[distributed] add known worker ids to distributed autograd context #26324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
As per pytorch#23110, each autograd pass would be assigned a unique autograd_context_id. In this change we introduce a DistAutogradContainer per worker which holds information for each autograd pass currently running. DistAutogradContainer has a map from the autograd_context_id to DistAutogradContext (which holds all the relevant information for the autograd pass). DistAutogradContext currently only stores the autograd_context_id and more information would be added to it later as we build out the rest of the framework. The autograd_context_id is a 64 bit globally unique integer where the first 16 bits are the worker_id and next 48 bits are auto-incrementing for uniqueness. Sample python code on how this would be used for distributed autograd: ``` import torch.distributed.autograd as dist_autograd worker_id = 0 dist_autograd.init(worker_id) with dist_autograd.context() as context_id: # forward pass... # backward pass... # optimizer step... ``` Differential Revision: [D16356694](https://our.internmc.facebook.com/intern/diff/D16356694/)
This contains very basic functionality of adding 'send' autograd function to our autograd graph. The purpose of this change is to validate the basic structure proposed here makes sense. Once this makes sense, we can build upon this to address more complicated scenarios. At a high level we've added the following functionality: 1) Define a very simple 'SendRpcBackwards' autograd function. 2) Attach this function to appropriate tensors when we call an RPC. 3) Store the send function in our distributed autograd context. Differential Revision: [D16903255](https://our.internmc.facebook.com/intern/diff/D16903255/)
As per pytorch#23110, each autograd pass would be assigned a unique autograd_context_id. In this change we introduce a DistAutogradContainer per worker which holds information for each autograd pass currently running. DistAutogradContainer has a map from the autograd_context_id to DistAutogradContext (which holds all the relevant information for the autograd pass). DistAutogradContext currently only stores the autograd_context_id and more information would be added to it later as we build out the rest of the framework. The autograd_context_id is a 64 bit globally unique integer where the first 16 bits are the worker_id and next 48 bits are auto-incrementing for uniqueness. Sample python code on how this would be used for distributed autograd: ``` import torch.distributed.autograd as dist_autograd worker_id = 0 dist_autograd.init(worker_id) with dist_autograd.context() as context_id: # forward pass... # backward pass... # optimizer step... ``` Differential Revision: [D16356694](https://our.internmc.facebook.com/intern/diff/D16356694/)
As per pytorch#23110, each autograd pass would be assigned a unique autograd_context_id. In this change we introduce a DistAutogradContainer per worker which holds information for each autograd pass currently running. DistAutogradContainer has a map from the autograd_context_id to DistAutogradContext (which holds all the relevant information for the autograd pass). DistAutogradContext currently only stores the autograd_context_id and more information would be added to it later as we build out the rest of the framework. The autograd_context_id is a 64 bit globally unique integer where the first 16 bits are the worker_id and next 48 bits are auto-incrementing for uniqueness. Sample python code on how this would be used for distributed autograd: ``` import torch.distributed.autograd as dist_autograd worker_id = 0 dist_autograd.init(worker_id) with dist_autograd.context() as context_id: # forward pass... # backward pass... # optimizer step... ``` Differential Revision: [D16356694](https://our.internmc.facebook.com/intern/diff/D16356694/)
…art of RPC." This contains very basic functionality of adding 'send' autograd function to our autograd graph. The purpose of this change is to validate the basic structure proposed here makes sense. Once this makes sense, we can build upon this to address more complicated scenarios. At a high level we've added the following functionality: 1) Define a very simple 'SendRpcBackwards' autograd function. 2) Attach this function to appropriate tensors when we call an RPC. 3) Store the send function in our distributed autograd context. GitHub Issue: pytorch#23110 Differential Revision: [D16903255](https://our.internmc.facebook.com/intern/diff/D16903255/)
As per pytorch#23110, each autograd pass would be assigned a unique autograd_context_id. In this change we introduce a DistAutogradContainer per worker which holds information for each autograd pass currently running. DistAutogradContainer has a map from the autograd_context_id to DistAutogradContext (which holds all the relevant information for the autograd pass). DistAutogradContext currently only stores the autograd_context_id and more information would be added to it later as we build out the rest of the framework. The autograd_context_id is a 64 bit globally unique integer where the first 16 bits are the worker_id and next 48 bits are auto-incrementing for uniqueness. Sample python code on how this would be used for distributed autograd: ``` import torch.distributed.autograd as dist_autograd worker_id = 0 dist_autograd.init(worker_id) with dist_autograd.context() as context_id: # forward pass... # backward pass... # optimizer step... ``` Differential Revision: [D16356694](https://our.internmc.facebook.com/intern/diff/D16356694/)
…art of RPC." This contains very basic functionality of adding 'send' autograd function to our autograd graph. The purpose of this change is to validate the basic structure proposed here makes sense. Once this makes sense, we can build upon this to address more complicated scenarios. At a high level we've added the following functionality: 1) Define a very simple 'SendRpcBackwards' autograd function. 2) Attach this function to appropriate tensors when we call an RPC. 3) Store the send function in our distributed autograd context. GitHub Issue: pytorch#23110 Differential Revision: [D16903255](https://our.internmc.facebook.com/intern/diff/D16903255/)
As per pytorch#23110, each autograd pass would be assigned a unique autograd_context_id. In this change we introduce a DistAutogradContainer per worker which holds information for each autograd pass currently running. DistAutogradContainer has a map from the autograd_context_id to DistAutogradContext (which holds all the relevant information for the autograd pass). DistAutogradContext currently only stores the autograd_context_id and more information would be added to it later as we build out the rest of the framework. The autograd_context_id is a 64 bit globally unique integer where the first 16 bits are the worker_id and next 48 bits are auto-incrementing for uniqueness. Sample python code on how this would be used for distributed autograd: ``` import torch.distributed.autograd as dist_autograd worker_id = 0 dist_autograd.init(worker_id) with dist_autograd.context() as context_id: # forward pass... # backward pass... # optimizer step... ``` Differential Revision: [D16356694](https://our.internmc.facebook.com/intern/diff/D16356694/)
…art of RPC." This contains very basic functionality of adding 'send' autograd function to our autograd graph. The purpose of this change is to validate the basic structure proposed here makes sense. Once this makes sense, we can build upon this to address more complicated scenarios. At a high level we've added the following functionality: 1) Define a very simple 'SendRpcBackwards' autograd function. 2) Attach this function to appropriate tensors when we call an RPC. 3) Store the send function in our distributed autograd context. GitHub Issue: pytorch#23110 Differential Revision: [D16903255](https://our.internmc.facebook.com/intern/diff/D16903255/)
…ograd graph as part of RPC." This contains very basic functionality of adding 'send' autograd function to our autograd graph. The purpose of this change is to validate the basic structure proposed here makes sense. Once this makes sense, we can build upon this to address more complicated scenarios. At a high level we've added the following functionality: 1) Define a very simple 'SendRpcBackwards' autograd function. 2) Attach this function to appropriate tensors when we call an RPC. 3) Store the send function in our distributed autograd context. GitHub Issue: pytorch#23110 Differential Revision: [D16903255](https://our.internmc.facebook.com/intern/diff/D16903255/)
…art of RPC." This contains very basic functionality of adding 'send' autograd function to our autograd graph. The purpose of this change is to validate the basic structure proposed here makes sense. Once this makes sense, we can build upon this to address more complicated scenarios. At a high level we've added the following functionality: 1) Define a very simple 'SendRpcBackwards' autograd function. 2) Attach this function to appropriate tensors when we call an RPC. 3) Store the send function in our distributed autograd context. GitHub Issue: pytorch#23110 Differential Revision: [D16903255](https://our.internmc.facebook.com/intern/diff/D16903255/)
…ograd graph as part of RPC." This contains very basic functionality of adding 'send' autograd function to our autograd graph. The purpose of this change is to validate the basic structure proposed here makes sense. Once this makes sense, we can build upon this to address more complicated scenarios. At a high level we've added the following functionality: 1) Define a very simple 'SendRpcBackwards' autograd function. 2) Attach this function to appropriate tensors when we call an RPC. 3) Store the send function in our distributed autograd context. GitHub Issue: pytorch#23110 Differential Revision: [D16903255](https://our.internmc.facebook.com/intern/diff/D16903255/)
…art of RPC." This contains very basic functionality of adding 'send' autograd function to our autograd graph. The purpose of this change is to validate the basic structure proposed here makes sense. Once this makes sense, we can build upon this to address more complicated scenarios. At a high level we've added the following functionality: 1) Define a very simple 'SendRpcBackwards' autograd function. 2) Attach this function to appropriate tensors when we call an RPC. 3) Store the send function in our distributed autograd context. GitHub Issue: pytorch#23110 Differential Revision: [D16903255](https://our.internmc.facebook.com/intern/diff/D16903255/)
Master GH issue: pytorch#23110. This change builds upon pytorch#24876 and provides all the autograd hooks needed for a forward pass with distributed rpc for builtin operators. This change does not address distributed rpc for python UDFs and that will be addressed in follow up PRs. Summary of changes: 1. Attach send autograd functions when a request is sent from the client and response is sent from the server. 2. Attach receive autograd functions when a request is received on the server and a response is received on the client. 3. Generate a globally unique autograd_message_id for each send/recv autograd function pair to uniquely identify them. Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
…uiltin operators RPC." Master GH issue: pytorch#23110. This change builds upon pytorch#24876 and provides all the autograd hooks needed for a forward pass with distributed rpc for builtin operators. This change does not address distributed rpc for python UDFs and that will be addressed in follow up PRs. Summary of changes: 1. Attach send autograd functions when a request is sent from the client and response is sent from the server. 2. Attach receive autograd functions when a request is received on the server and a response is received on the client. 3. Generate a globally unique autograd_message_id for each send/recv autograd function pair to uniquely identify them. Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
… RPC." Master GH issue: pytorch#23110. This change builds upon pytorch#24876 and provides all the autograd hooks needed for a forward pass with distributed rpc for builtin operators. This change does not address distributed rpc for python UDFs and that will be addressed in follow up PRs. Summary of changes: 1. Attach send autograd functions when a request is sent from the client and response is sent from the server. 2. Attach receive autograd functions when a request is received on the server and a response is received on the client. 3. Generate a globally unique autograd_message_id for each send/recv autograd function pair to uniquely identify them. Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
…uiltin operators RPC." Master GH issue: pytorch#23110. This change builds upon pytorch#24876 and provides all the autograd hooks needed for a forward pass with distributed rpc for builtin operators. This change does not address distributed rpc for python UDFs and that will be addressed in follow up PRs. Summary of changes: 1. Attach send autograd functions when a request is sent from the client and response is sent from the server. 2. Attach receive autograd functions when a request is received on the server and a response is received on the client. 3. Generate a globally unique autograd_message_id for each send/recv autograd function pair to uniquely identify them. Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
… RPC." Master GH issue: pytorch#23110. This change builds upon pytorch#24876 and provides all the autograd hooks needed for a forward pass with distributed rpc for builtin operators. This change does not address distributed rpc for python UDFs and that will be addressed in follow up PRs. Summary of changes: 1. Attach send autograd functions when a request is sent from the client and response is sent from the server. 2. Attach receive autograd functions when a request is received on the server and a response is received on the client. 3. Generate a globally unique autograd_message_id for each send/recv autograd function pair to uniquely identify them. Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
…uiltin operators RPC." Master GH issue: pytorch#23110. This change builds upon pytorch#24876 and provides all the autograd hooks needed for a forward pass with distributed rpc for builtin operators. This change does not address distributed rpc for python UDFs and that will be addressed in follow up PRs. Summary of changes: 1. Attach send autograd functions when a request is sent from the client and response is sent from the server. 2. Attach receive autograd functions when a request is received on the server and a response is received on the client. 3. Generate a globally unique autograd_message_id for each send/recv autograd function pair to uniquely identify them. Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
… RPC." Master GH issue: pytorch#23110. This change builds upon pytorch#24876 and provides all the autograd hooks needed for a forward pass with distributed rpc for builtin operators. This change does not address distributed rpc for python UDFs and that will be addressed in follow up PRs. Summary of changes: 1. Attach send autograd functions when a request is sent from the client and response is sent from the server. 2. Attach receive autograd functions when a request is received on the server and a response is received on the client. 3. Generate a globally unique autograd_message_id for each send/recv autograd function pair to uniquely identify them. Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
…uiltin operators RPC." Master GH issue: pytorch#23110. This change builds upon pytorch#24876 and provides all the autograd hooks needed for a forward pass with distributed rpc for builtin operators. This change does not address distributed rpc for python UDFs and that will be addressed in follow up PRs. Summary of changes: 1. Attach send autograd functions when a request is sent from the client and response is sent from the server. 2. Attach receive autograd functions when a request is received on the server and a response is received on the client. 3. Generate a globally unique autograd_message_id for each send/recv autograd function pair to uniquely identify them. Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
… RPC." Master GH issue: pytorch#23110. This change builds upon pytorch#24876 and provides all the autograd hooks needed for a forward pass with distributed rpc for builtin operators. This change does not address distributed rpc for python UDFs and that will be addressed in follow up PRs. Summary of changes: 1. Attach send autograd functions when a request is sent from the client and response is sent from the server. 2. Attach receive autograd functions when a request is received on the server and a response is received on the client. 3. Generate a globally unique autograd_message_id for each send/recv autograd function pair to uniquely identify them. Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
Contributor
|
@rohan-varma merged this pull request in b5e0fd4. |
rohan-varma
added a commit
that referenced
this pull request
Oct 15, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 15, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 16, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 16, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 16, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 16, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 17, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 17, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 18, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 18, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 18, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 18, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 18, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 19, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 19, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 20, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 20, 2019
…one. Pull Request resolved: #27951 we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. ghstack-source-id: 92259078 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
rohan-varma
added a commit
that referenced
this pull request
Oct 20, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 20, 2019
…one. Pull Request resolved: #27951 we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. ghstack-source-id: 92261890 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
rohan-varma
added a commit
that referenced
this pull request
Oct 20, 2019
… when it is released on one node" Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/) [ghstack-poisoned]
rohan-varma
added a commit
that referenced
this pull request
Oct 20, 2019
…one. Pull Request resolved: #27951 we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. ghstack-source-id: 92269279 Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
facebook-github-bot
pushed a commit
that referenced
this pull request
Oct 21, 2019
…ne node (#27951) Summary: Pull Request resolved: #27951 we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. ghstack-source-id: 92269279 Test Plan: Added/modified unit tests in `test/dist_autograd_test.py` Differential Revision: D17920137 fbshipit-source-id: 7403512ab5fcbc28d21c548b2e45319dd472e26a
thiagocrepaldi
pushed a commit
to thiagocrepaldi/pytorch
that referenced
this pull request
Feb 4, 2020
Summary: Per pytorch#25525 we want to clean up distributed autograd context on all nodes, in addition to the local one. To do this, we want to send async RPCs to the other nodes telling them to clean up the context. The first step for this is for a node's context to know about the other workers. This PR does two things: 1) Adds the necessary data structures and getter functions to `DistAutogradContext` 2) Refactors calls to `addSendRpcBackward` to take in the `worker_id` as an additional argument Pull Request resolved: pytorch#26324 Differential Revision: D17769411 Pulled By: rohan-varma fbshipit-source-id: b7327d1209a574e2e88cb197edff3103024d51ad
thiagocrepaldi
pushed a commit
to thiagocrepaldi/pytorch
that referenced
this pull request
Feb 4, 2020
…ne node (pytorch#27951) Summary: Pull Request resolved: pytorch#27951 we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`). This PR does a few things to implement the above: 1) Add classes to encapsulate messages for requesting this context release and the response 2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it. 3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in pytorch#26324) 4) Relevant unit tests In follow up PRs, we will add error checking + retries for this call. ghstack-source-id: 92269279 Test Plan: Added/modified unit tests in `test/dist_autograd_test.py` Differential Revision: D17920137 fbshipit-source-id: 7403512ab5fcbc28d21c548b2e45319dd472e26a
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Merged
module: cpp
Related to C++ API
module: pybind
Related to our Python bindings / interactions with other Python libraries
module: rpc
Related to RPC, distributed autograd, RRef, and distributed optimizer
oncall: distributed
Add this issue/PR to distributed oncall triage queue
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Per #25525 we want to clean up distributed autograd context on all nodes, in addition to the local one. To do this, we want to send async RPCs to the other nodes telling them to clean up the context.
The first step for this is for a node's context to know about the other workers. This PR does two things:
DistAutogradContextaddSendRpcBackwardto take in theworker_idas an additional argument