Skip to content

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Dec 7, 2019

Stack from ghstack:

When tensors don't require grad, we don't call addSendRpcBackward, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the addKnownWorkerIds call to the getMessageWithAutograd function.

Differential Revision: D18869191

… are

recorded in all cases

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since D18496709, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)

[ghstack-poisoned]
…ensure they are

recorded in all cases"


When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 7, 2019
… are

recorded in all cases

Pull Request resolved: #30914

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.
ghstack-source-id: 95144641

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)
…ensure they are

recorded in all cases"

recorded in all cases**

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)

[ghstack-poisoned]
…ensure they are

recorded in all cases"

recorded in all cases**
recorded in all cases**

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 7, 2019
… are

recorded in all cases

Pull Request resolved: #30914

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.
ghstack-source-id: 95158324

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)
…ensure they are

recorded in all cases"

recorded in all cases**

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 8, 2019
… are

recorded in all cases

Pull Request resolved: #30914

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.
ghstack-source-id: 95166853

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)
…ensure they are

recorded in all cases"

recorded in all cases**
recorded in all cases**

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 9, 2019
… are

recorded in all cases

Pull Request resolved: #30914

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.
ghstack-source-id: 95178561

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 4f342a6.

xush6528 added a commit that referenced this pull request Dec 12, 2019
Cleanup after #30914.

In #30914, `autogradContext->addKnownWorkerId(dst);` was moved out of `addSendRpcBackward()`.

So `addSendRpcBackward()` does not need `dstId` as it's argument anymore.

Differential Revision: [D5742365](https://our.internmc.facebook.com/intern/diff/D5742365/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Dec 12, 2019
Cleanup after #30914.

In #30914, `autogradContext->addKnownWorkerId(dst);` was moved out of `addSendRpcBackward()`.

So `addSendRpcBackward()` does not need `dstId` as it's argument anymore.

Differential Revision: [D5742365](https://our.internmc.facebook.com/intern/diff/D5742365/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Dec 12, 2019
Pull Request resolved: #31207

Cleanup after #30914.

In #30914, `autogradContext->addKnownWorkerId(dst);` was moved out of `addSendRpcBackward()`.

So `addSendRpcBackward()` does not need `dstId` as it's argument anymore.
ghstack-source-id: 95509218

Differential Revision: [D5742365](https://our.internmc.facebook.com/intern/diff/D5742365/)
@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/48/head branch December 13, 2019 15:17
facebook-github-bot pushed a commit that referenced this pull request Dec 15, 2019
Summary:
Pull Request resolved: #31207

Cleanup after #30914.

In #30914, `autogradContext->addKnownWorkerId(dst);` was moved out of `addSendRpcBackward()`.

So `addSendRpcBackward()` does not need `dstId` as it's argument anymore.
ghstack-source-id: 95509218

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork -- test_context_cleanup_tensor_no_grad
```

Differential Revision: D5742365

fbshipit-source-id: accd041a594ec18d369231f5590289828d87baa7
wuhuikx pushed a commit to wuhuikx/pytorch that referenced this pull request Jan 30, 2020
…ytorch#30914)

Summary:
Pull Request resolved: pytorch#30914

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  pytorch#29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.
ghstack-source-id: 95178561

Test Plan: Added a unit test: `test_context_cleanup_tensor_no_grad`

Differential Revision: D18869191

fbshipit-source-id: b80f66bfd0dd7d01960abe1691d3f44095bb1b2b
wuhuikx pushed a commit to wuhuikx/pytorch that referenced this pull request Jan 30, 2020
Summary:
Pull Request resolved: pytorch#31207

Cleanup after pytorch#30914.

In pytorch#30914, `autogradContext->addKnownWorkerId(dst);` was moved out of `addSendRpcBackward()`.

So `addSendRpcBackward()` does not need `dstId` as it's argument anymore.
ghstack-source-id: 95509218

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork -- test_context_cleanup_tensor_no_grad
```

Differential Revision: D5742365

fbshipit-source-id: accd041a594ec18d369231f5590289828d87baa7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants