Skip to content

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Sep 16, 2019

Per #25525 we want to clean up distributed autograd context on all nodes, in addition to the local one. To do this, we want to send async RPCs to the other nodes telling them to clean up the context.

The first step for this is for a node's context to know about the other workers. This PR does two things:

  1. Adds the necessary data structures and getter functions to DistAutogradContext
  2. Refactors calls to addSendRpcBackward to take in the worker_id as an additional argument

pritam and others added 23 commits August 19, 2019 16:32
As per pytorch#23110, each autograd pass
would be assigned a unique autograd_context_id. In this change we introduce a
DistAutogradContainer per worker which holds information for each autograd pass
currently running.

DistAutogradContainer has a map from the autograd_context_id to
DistAutogradContext (which holds all the relevant information for the autograd
pass). DistAutogradContext currently only stores the autograd_context_id and
more information would be added to it later as we build out the rest of the
framework.

The autograd_context_id is a 64 bit globally unique integer where the first 16
bits are the worker_id and next 48 bits are auto-incrementing for uniqueness.

Sample python code on how this would be used for distributed autograd:

```
import torch.distributed.autograd as dist_autograd
worker_id = 0
dist_autograd.init(worker_id)
with dist_autograd.context() as context_id:
     # forward pass...
     # backward pass...
     # optimizer step...
```

Differential Revision: [D16356694](https://our.internmc.facebook.com/intern/diff/D16356694/)
This contains very basic functionality of adding 'send' autograd
function to our autograd graph. The purpose of this change is to validate the
basic structure proposed here makes sense. Once this makes sense, we can build
upon this to address more complicated scenarios. At a high level we've added
the following functionality:

1) Define a very simple 'SendRpcBackwards' autograd function.
2) Attach this function to appropriate tensors when we call an RPC.
3) Store the send function in our distributed autograd context.

Differential Revision: [D16903255](https://our.internmc.facebook.com/intern/diff/D16903255/)
As per pytorch#23110, each autograd pass
would be assigned a unique autograd_context_id. In this change we introduce a
DistAutogradContainer per worker which holds information for each autograd pass
currently running.

DistAutogradContainer has a map from the autograd_context_id to
DistAutogradContext (which holds all the relevant information for the autograd
pass). DistAutogradContext currently only stores the autograd_context_id and
more information would be added to it later as we build out the rest of the
framework.

The autograd_context_id is a 64 bit globally unique integer where the first 16
bits are the worker_id and next 48 bits are auto-incrementing for uniqueness.

Sample python code on how this would be used for distributed autograd:

```
import torch.distributed.autograd as dist_autograd
worker_id = 0
dist_autograd.init(worker_id)
with dist_autograd.context() as context_id:
     # forward pass...
     # backward pass...
     # optimizer step...
```

Differential Revision: [D16356694](https://our.internmc.facebook.com/intern/diff/D16356694/)
As per pytorch#23110, each autograd pass
would be assigned a unique autograd_context_id. In this change we introduce a
DistAutogradContainer per worker which holds information for each autograd pass
currently running.

DistAutogradContainer has a map from the autograd_context_id to
DistAutogradContext (which holds all the relevant information for the autograd
pass). DistAutogradContext currently only stores the autograd_context_id and
more information would be added to it later as we build out the rest of the
framework.

The autograd_context_id is a 64 bit globally unique integer where the first 16
bits are the worker_id and next 48 bits are auto-incrementing for uniqueness.

Sample python code on how this would be used for distributed autograd:

```
import torch.distributed.autograd as dist_autograd
worker_id = 0
dist_autograd.init(worker_id)
with dist_autograd.context() as context_id:
     # forward pass...
     # backward pass...
     # optimizer step...
```

Differential Revision: [D16356694](https://our.internmc.facebook.com/intern/diff/D16356694/)
…art of RPC."


This contains very basic functionality of adding 'send' autograd
function to our autograd graph. The purpose of this change is to validate the
basic structure proposed here makes sense. Once this makes sense, we can build
upon this to address more complicated scenarios. At a high level we've added
the following functionality:

1) Define a very simple 'SendRpcBackwards' autograd function.
2) Attach this function to appropriate tensors when we call an RPC.
3) Store the send function in our distributed autograd context.

GitHub Issue: pytorch#23110
Differential Revision: [D16903255](https://our.internmc.facebook.com/intern/diff/D16903255/)
As per pytorch#23110, each autograd pass
would be assigned a unique autograd_context_id. In this change we introduce a
DistAutogradContainer per worker which holds information for each autograd pass
currently running.

DistAutogradContainer has a map from the autograd_context_id to
DistAutogradContext (which holds all the relevant information for the autograd
pass). DistAutogradContext currently only stores the autograd_context_id and
more information would be added to it later as we build out the rest of the
framework.

The autograd_context_id is a 64 bit globally unique integer where the first 16
bits are the worker_id and next 48 bits are auto-incrementing for uniqueness.

Sample python code on how this would be used for distributed autograd:

```
import torch.distributed.autograd as dist_autograd
worker_id = 0
dist_autograd.init(worker_id)
with dist_autograd.context() as context_id:
     # forward pass...
     # backward pass...
     # optimizer step...
```

Differential Revision: [D16356694](https://our.internmc.facebook.com/intern/diff/D16356694/)
…art of RPC."


This contains very basic functionality of adding 'send' autograd
function to our autograd graph. The purpose of this change is to validate the
basic structure proposed here makes sense. Once this makes sense, we can build
upon this to address more complicated scenarios. At a high level we've added
the following functionality:

1) Define a very simple 'SendRpcBackwards' autograd function.
2) Attach this function to appropriate tensors when we call an RPC.
3) Store the send function in our distributed autograd context.

GitHub Issue: pytorch#23110
Differential Revision: [D16903255](https://our.internmc.facebook.com/intern/diff/D16903255/)
As per pytorch#23110, each autograd pass
would be assigned a unique autograd_context_id. In this change we introduce a
DistAutogradContainer per worker which holds information for each autograd pass
currently running.

DistAutogradContainer has a map from the autograd_context_id to
DistAutogradContext (which holds all the relevant information for the autograd
pass). DistAutogradContext currently only stores the autograd_context_id and
more information would be added to it later as we build out the rest of the
framework.

The autograd_context_id is a 64 bit globally unique integer where the first 16
bits are the worker_id and next 48 bits are auto-incrementing for uniqueness.

Sample python code on how this would be used for distributed autograd:

```
import torch.distributed.autograd as dist_autograd
worker_id = 0
dist_autograd.init(worker_id)
with dist_autograd.context() as context_id:
     # forward pass...
     # backward pass...
     # optimizer step...
```

Differential Revision: [D16356694](https://our.internmc.facebook.com/intern/diff/D16356694/)
…art of RPC."


This contains very basic functionality of adding 'send' autograd
function to our autograd graph. The purpose of this change is to validate the
basic structure proposed here makes sense. Once this makes sense, we can build
upon this to address more complicated scenarios. At a high level we've added
the following functionality:

1) Define a very simple 'SendRpcBackwards' autograd function.
2) Attach this function to appropriate tensors when we call an RPC.
3) Store the send function in our distributed autograd context.

GitHub Issue: pytorch#23110
Differential Revision: [D16903255](https://our.internmc.facebook.com/intern/diff/D16903255/)
…ograd graph as part of RPC."


This contains very basic functionality of adding 'send' autograd
function to our autograd graph. The purpose of this change is to validate the
basic structure proposed here makes sense. Once this makes sense, we can build
upon this to address more complicated scenarios. At a high level we've added
the following functionality:

1) Define a very simple 'SendRpcBackwards' autograd function.
2) Attach this function to appropriate tensors when we call an RPC.
3) Store the send function in our distributed autograd context.

GitHub Issue: pytorch#23110
Differential Revision: [D16903255](https://our.internmc.facebook.com/intern/diff/D16903255/)
…art of RPC."


This contains very basic functionality of adding 'send' autograd
function to our autograd graph. The purpose of this change is to validate the
basic structure proposed here makes sense. Once this makes sense, we can build
upon this to address more complicated scenarios. At a high level we've added
the following functionality:

1) Define a very simple 'SendRpcBackwards' autograd function.
2) Attach this function to appropriate tensors when we call an RPC.
3) Store the send function in our distributed autograd context.

GitHub Issue: pytorch#23110
Differential Revision: [D16903255](https://our.internmc.facebook.com/intern/diff/D16903255/)
…ograd graph as part of RPC."


This contains very basic functionality of adding 'send' autograd
function to our autograd graph. The purpose of this change is to validate the
basic structure proposed here makes sense. Once this makes sense, we can build
upon this to address more complicated scenarios. At a high level we've added
the following functionality:

1) Define a very simple 'SendRpcBackwards' autograd function.
2) Attach this function to appropriate tensors when we call an RPC.
3) Store the send function in our distributed autograd context.

GitHub Issue: pytorch#23110
Differential Revision: [D16903255](https://our.internmc.facebook.com/intern/diff/D16903255/)
…art of RPC."


This contains very basic functionality of adding 'send' autograd
function to our autograd graph. The purpose of this change is to validate the
basic structure proposed here makes sense. Once this makes sense, we can build
upon this to address more complicated scenarios. At a high level we've added
the following functionality:

1) Define a very simple 'SendRpcBackwards' autograd function.
2) Attach this function to appropriate tensors when we call an RPC.
3) Store the send function in our distributed autograd context.

GitHub Issue: pytorch#23110
Differential Revision: [D16903255](https://our.internmc.facebook.com/intern/diff/D16903255/)
Master GH issue: pytorch#23110.

This change builds upon pytorch#24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.

Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.

Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
…uiltin operators RPC."

Master GH issue: pytorch#23110.

This change builds upon pytorch#24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.

Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.

Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
… RPC."

Master GH issue: pytorch#23110.

This change builds upon pytorch#24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.

Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.

Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
…uiltin operators RPC."

Master GH issue: pytorch#23110.

This change builds upon pytorch#24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.

Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.

Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
… RPC."

Master GH issue: pytorch#23110.

This change builds upon pytorch#24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.

Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.

Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
…uiltin operators RPC."

Master GH issue: pytorch#23110.

This change builds upon pytorch#24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.

Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.

Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
… RPC."

Master GH issue: pytorch#23110.

This change builds upon pytorch#24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.

Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.

Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
…uiltin operators RPC."

Master GH issue: pytorch#23110.

This change builds upon pytorch#24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.

Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.

Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
… RPC."

Master GH issue: pytorch#23110.

This change builds upon pytorch#24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.

Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.

Differential Revision: [D17148077](https://our.internmc.facebook.com/intern/diff/D17148077/)
@pytorchbot pytorchbot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 16, 2019
@facebook-github-bot
Copy link
Contributor

@rohan-varma merged this pull request in b5e0fd4.

rohan-varma added a commit that referenced this pull request Oct 15, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem.

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 15, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem.

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 16, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem.

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 16, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem.

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 16, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem.

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 16, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem.

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 17, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem.

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 17, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem.

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 18, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem.

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 18, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem.

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 18, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem.

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 18, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem.

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 18, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem.

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 19, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem.

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 19, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

Note: the current version is very simple and does not attempt to do cycle detection of RPCs or any retries. If `releaseContext` is called directly on one node, then that node will send RPCs to the other nodes it knows about to release their context. However, if a node receives an RPC that tells it to release its context, it will do so, but will not forward this request to other nodes that it knows about. This is to avoid cycles. The limitation of this approach is that the entire graph of nodes may not have their contexts released. In follow-up PRs, we can implement better cycle detection to solve this problem.

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 20, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

In follow up PRs, we will add error checking + retries for this call.
Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 20, 2019
…one.

Pull Request resolved: #27951

we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

In follow up PRs, we will add error checking + retries for this call.

ghstack-source-id: 92259078

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
rohan-varma added a commit that referenced this pull request Oct 20, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

In follow up PRs, we will add error checking + retries for this call.
Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 20, 2019
…one.

Pull Request resolved: #27951

we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

In follow up PRs, we will add error checking + retries for this call.

ghstack-source-id: 92261890

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
rohan-varma added a commit that referenced this pull request Oct 20, 2019
… when it is released on one node"


Per #25525, we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

In follow up PRs, we will add error checking + retries for this call.
Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 20, 2019
…one.

Pull Request resolved: #27951

we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

In follow up PRs, we will add error checking + retries for this call.

ghstack-source-id: 92269279

Differential Revision: [D17920137](https://our.internmc.facebook.com/intern/diff/D17920137/)
facebook-github-bot pushed a commit that referenced this pull request Oct 21, 2019
…ne node (#27951)

Summary:
Pull Request resolved: #27951

we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in #26324)
4) Relevant unit tests

In follow up PRs, we will add error checking + retries for this call.

ghstack-source-id: 92269279

Test Plan: Added/modified unit tests in `test/dist_autograd_test.py`

Differential Revision: D17920137

fbshipit-source-id: 7403512ab5fcbc28d21c548b2e45319dd472e26a
thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
Summary:
Per pytorch#25525 we want to clean up distributed autograd context on all nodes, in addition to the local one. To do this, we want to send async RPCs to the other nodes telling them to clean up the context.

The first step for this is for a node's context to know about the other workers. This PR does two things:

1) Adds the necessary data structures and getter functions to `DistAutogradContext`
2) Refactors calls to `addSendRpcBackward` to take in the `worker_id` as an additional argument
Pull Request resolved: pytorch#26324

Differential Revision: D17769411

Pulled By: rohan-varma

fbshipit-source-id: b7327d1209a574e2e88cb197edff3103024d51ad
thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
…ne node (pytorch#27951)

Summary:
Pull Request resolved: pytorch#27951

we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in pytorch#26324)
4) Relevant unit tests

In follow up PRs, we will add error checking + retries for this call.

ghstack-source-id: 92269279

Test Plan: Added/modified unit tests in `test/dist_autograd_test.py`

Differential Revision: D17920137

fbshipit-source-id: 7403512ab5fcbc28d21c548b2e45319dd472e26a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cpp Related to C++ API module: pybind Related to our Python bindings / interactions with other Python libraries module: rpc Related to RPC, distributed autograd, RRef, and distributed optimizer oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants