Skip to content

Conversation

@mrshenli
Copy link
Contributor

@mrshenli mrshenli commented Nov 14, 2019

Stack from ghstack:

Even though the request might not contain any requires_grad tensor,
the return value could. Therefore, we should always include the
autograd context id in the request.

closes #28819

Differential Revision: D18496709

Even though the request might not contain any requires_grad tensor,
the return value could. Therefore, we should always include the
autograd context id in the request.

[ghstack-poisoned]
Even though the request might not contain any requires_grad tensor,
the return value could. Therefore, we should always include the
autograd context id in the request.

closes #28819 

[ghstack-poisoned]
Even though the request might not contain any requires_grad tensor,
the return value could. Therefore, we should always include the
autograd context id in the request.

closes #28819

Differential Revision: [D18496709](https://our.internmc.facebook.com/intern/diff/D18496709)

[ghstack-poisoned]
Comment on lines 619 to 623
ctx = dist_autograd._current_context()
send_functions = ctx._send_functions()
self.assertEqual(len(send_functions), 0)
recv_functions = ctx._recv_functions()
self.assertEqual(len(recv_functions), 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking the autograd functions, can we run the backward pass and verify the gradient is accumulated correctly on the remote tensor? We probably might need to return a global tensor from ret_requires_grad to validate the grads are accumulated for ExecMode.RPC_SYNC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let me add that.

Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! Just have one comment inline for the unit test.

Even though the request might not contain any requires_grad tensor,
the return value could. Therefore, we should always include the
autograd context id in the request.

closes #28819

Differential Revision: [D18496709](https://our.internmc.facebook.com/intern/diff/D18496709)

[ghstack-poisoned]
Even though the request might not contain any requires_grad tensor,
the return value could. Therefore, we should always include the
autograd context id in the request.

closes #28819

Differential Revision: [D18496709](https://our.internmc.facebook.com/intern/diff/D18496709)

[ghstack-poisoned]
Even though the request might not contain any requires_grad tensor,
the return value could. Therefore, we should always include the
autograd context id in the request.

closes #28819

Differential Revision: [D18496709](https://our.internmc.facebook.com/intern/diff/D18496709)

[ghstack-poisoned]
@mrshenli
Copy link
Contributor Author

The lint failure is on a file that I didn't touch:

/home/vsts/work/1/s/aten/src/ATen/cuda/ATenCUDAGeneral.h:3:10: error: 'cuda.h' file not found [clang-diagnostic-error]

@facebook-github-bot
Copy link
Contributor

@mrshenli merged this pull request in e1a309a.

csarofeen pushed a commit to mruberry/pytorch that referenced this pull request Nov 18, 2019
)

Summary:
Pull Request resolved: pytorch#29781

Even though the request might not contain any requires_grad tensor,
the return value could. Therefore, we should always include the
autograd context id in the request.

closes pytorch#28819

Test Plan: Imported from OSS

Differential Revision: D18496709

Pulled By: mrshenli

fbshipit-source-id: 2f870c410291a1300952895b7488ea07e5574228
@facebook-github-bot facebook-github-bot deleted the gh/mrshenli/51/head branch November 18, 2019 15:16
rohan-varma added a commit that referenced this pull request Dec 7, 2019
…ensure they are

recorded in all cases"


When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 7, 2019
… are

recorded in all cases

Pull Request resolved: #30914

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.
ghstack-source-id: 95144641

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)
rohan-varma added a commit that referenced this pull request Dec 7, 2019
…ensure they are

recorded in all cases"

recorded in all cases**

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 7, 2019
…ensure they are

recorded in all cases"

recorded in all cases**
recorded in all cases**

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 7, 2019
… are

recorded in all cases

Pull Request resolved: #30914

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.
ghstack-source-id: 95158324

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)
rohan-varma added a commit that referenced this pull request Dec 8, 2019
…ensure they are

recorded in all cases"

recorded in all cases**

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 8, 2019
… are

recorded in all cases

Pull Request resolved: #30914

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.
ghstack-source-id: 95166853

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)
rohan-varma added a commit that referenced this pull request Dec 9, 2019
…ensure they are

recorded in all cases"

recorded in all cases**
recorded in all cases**

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 9, 2019
… are

recorded in all cases

Pull Request resolved: #30914

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.
ghstack-source-id: 95178561

Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)
facebook-github-bot pushed a commit that referenced this pull request Dec 9, 2019
…30914)

Summary:
Pull Request resolved: #30914

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.
ghstack-source-id: 95178561

Test Plan: Added a unit test: `test_context_cleanup_tensor_no_grad`

Differential Revision: D18869191

fbshipit-source-id: b80f66bfd0dd7d01960abe1691d3f44095bb1b2b
xxtEchjovs44 pushed a commit to xxtEchjovs44/pytorch that referenced this pull request Jan 29, 2020
Even though the request might not contain any requires_grad tensor,
the return value could. Therefore, we should always include the
autograd context id in the request.

ghstack-source-id: 75cd6bf
Pull Request resolved: pytorch/pytorch#29781
wuhuikx pushed a commit to wuhuikx/pytorch that referenced this pull request Jan 30, 2020
…ytorch#30914)

Summary:
Pull Request resolved: pytorch#30914

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  pytorch#29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.
ghstack-source-id: 95178561

Test Plan: Added a unit test: `test_context_cleanup_tensor_no_grad`

Differential Revision: D18869191

fbshipit-source-id: b80f66bfd0dd7d01960abe1691d3f44095bb1b2b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants