-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Always include autograd context id in rpc/remote requests #29781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Even though the request might not contain any requires_grad tensor, the return value could. Therefore, we should always include the autograd context id in the request. [ghstack-poisoned]
Even though the request might not contain any requires_grad tensor, the return value could. Therefore, we should always include the autograd context id in the request. closes #28819 [ghstack-poisoned]
Even though the request might not contain any requires_grad tensor, the return value could. Therefore, we should always include the autograd context id in the request. closes #28819 Differential Revision: [D18496709](https://our.internmc.facebook.com/intern/diff/D18496709) [ghstack-poisoned]
test/dist_autograd_test.py
Outdated
| ctx = dist_autograd._current_context() | ||
| send_functions = ctx._send_functions() | ||
| self.assertEqual(len(send_functions), 0) | ||
| recv_functions = ctx._recv_functions() | ||
| self.assertEqual(len(recv_functions), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of checking the autograd functions, can we run the backward pass and verify the gradient is accumulated correctly on the remote tensor? We probably might need to return a global tensor from ret_requires_grad to validate the grads are accumulated for ExecMode.RPC_SYNC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let me add that.
pritamdamania87
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this! Just have one comment inline for the unit test.
Even though the request might not contain any requires_grad tensor, the return value could. Therefore, we should always include the autograd context id in the request. closes #28819 Differential Revision: [D18496709](https://our.internmc.facebook.com/intern/diff/D18496709) [ghstack-poisoned]
Even though the request might not contain any requires_grad tensor, the return value could. Therefore, we should always include the autograd context id in the request. closes #28819 Differential Revision: [D18496709](https://our.internmc.facebook.com/intern/diff/D18496709) [ghstack-poisoned]
Even though the request might not contain any requires_grad tensor, the return value could. Therefore, we should always include the autograd context id in the request. closes #28819 Differential Revision: [D18496709](https://our.internmc.facebook.com/intern/diff/D18496709) [ghstack-poisoned]
|
The lint failure is on a file that I didn't touch: |
) Summary: Pull Request resolved: pytorch#29781 Even though the request might not contain any requires_grad tensor, the return value could. Therefore, we should always include the autograd context id in the request. closes pytorch#28819 Test Plan: Imported from OSS Differential Revision: D18496709 Pulled By: mrshenli fbshipit-source-id: 2f870c410291a1300952895b7488ea07e5574228
…ensure they are recorded in all cases" When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/) [ghstack-poisoned]
… are recorded in all cases Pull Request resolved: #30914 When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. ghstack-source-id: 95144641 Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)
…ensure they are recorded in all cases" recorded in all cases** When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/) [ghstack-poisoned]
…ensure they are recorded in all cases" recorded in all cases** recorded in all cases** When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/) [ghstack-poisoned]
… are recorded in all cases Pull Request resolved: #30914 When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. ghstack-source-id: 95158324 Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)
…ensure they are recorded in all cases" recorded in all cases** When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/) [ghstack-poisoned]
… are recorded in all cases Pull Request resolved: #30914 When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. ghstack-source-id: 95166853 Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)
…ensure they are recorded in all cases" recorded in all cases** recorded in all cases** When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/) [ghstack-poisoned]
… are recorded in all cases Pull Request resolved: #30914 When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. ghstack-source-id: 95178561 Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)
…30914) Summary: Pull Request resolved: #30914 When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. ghstack-source-id: 95178561 Test Plan: Added a unit test: `test_context_cleanup_tensor_no_grad` Differential Revision: D18869191 fbshipit-source-id: b80f66bfd0dd7d01960abe1691d3f44095bb1b2b
Even though the request might not contain any requires_grad tensor, the return value could. Therefore, we should always include the autograd context id in the request. ghstack-source-id: 75cd6bf Pull Request resolved: pytorch/pytorch#29781
…ytorch#30914) Summary: Pull Request resolved: pytorch#30914 When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since pytorch#29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. ghstack-source-id: 95178561 Test Plan: Added a unit test: `test_context_cleanup_tensor_no_grad` Differential Revision: D18869191 fbshipit-source-id: b80f66bfd0dd7d01960abe1691d3f44095bb1b2b
Stack from ghstack:
Even though the request might not contain any requires_grad tensor,
the return value could. Therefore, we should always include the
autograd context id in the request.
closes #28819
Differential Revision: D18496709