Always include autograd context id in rpc/remote requests #29781

mrshenli · 2019-11-14T01:02:34Z

Stack from ghstack:

[WIP] Add end-to-end test for RNN Module #29543 [WIP] Add end-to-end test for RNN Module
Always include autograd context id in rpc/remote requests #29781 Always include autograd context id in rpc/remote requests
Allow rpc.remote to create RRef on self #29634 Allow rpc.remote to create RRef on self

Even though the request might not contain any requires_grad tensor,
the return value could. Therefore, we should always include the
autograd context id in the request.

closes #28819

Differential Revision: D18496709

Even though the request might not contain any requires_grad tensor, the return value could. Therefore, we should always include the autograd context id in the request. [ghstack-poisoned]

Even though the request might not contain any requires_grad tensor, the return value could. Therefore, we should always include the autograd context id in the request. closes #28819 [ghstack-poisoned]

Even though the request might not contain any requires_grad tensor, the return value could. Therefore, we should always include the autograd context id in the request. closes #28819 Differential Revision: [D18496709](https://our.internmc.facebook.com/intern/diff/D18496709) [ghstack-poisoned]

pritamdamania87 · 2019-11-14T01:46:58Z

test/dist_autograd_test.py

+            ctx = dist_autograd._current_context()
+            send_functions = ctx._send_functions()
+            self.assertEqual(len(send_functions), 0)
+            recv_functions = ctx._recv_functions()
+            self.assertEqual(len(recv_functions), 1)


Instead of checking the autograd functions, can we run the backward pass and verify the gradient is accumulated correctly on the remote tensor? We probably might need to return a global tensor from ret_requires_grad to validate the grads are accumulated for ExecMode.RPC_SYNC

Yes, let me add that.

pritamdamania87

Thanks for fixing this! Just have one comment inline for the unit test.

Even though the request might not contain any requires_grad tensor, the return value could. Therefore, we should always include the autograd context id in the request. closes #28819 Differential Revision: [D18496709](https://our.internmc.facebook.com/intern/diff/D18496709) [ghstack-poisoned]

test/dist_autograd_test.py

Even though the request might not contain any requires_grad tensor, the return value could. Therefore, we should always include the autograd context id in the request. closes #28819 Differential Revision: [D18496709](https://our.internmc.facebook.com/intern/diff/D18496709) [ghstack-poisoned]

mrshenli · 2019-11-14T23:57:06Z

The lint failure is on a file that I didn't touch:

/home/vsts/work/1/s/aten/src/ATen/cuda/ATenCUDAGeneral.h:3:10: error: 'cuda.h' file not found [clang-diagnostic-error]

facebook-github-bot · 2019-11-15T08:08:35Z

@mrshenli merged this pull request in e1a309a.

) Summary: Pull Request resolved: pytorch#29781 Even though the request might not contain any requires_grad tensor, the return value could. Therefore, we should always include the autograd context id in the request. closes pytorch#28819 Test Plan: Imported from OSS Differential Revision: D18496709 Pulled By: mrshenli fbshipit-source-id: 2f870c410291a1300952895b7488ea07e5574228

…ensure they are recorded in all cases" When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/) [ghstack-poisoned]

… are recorded in all cases Pull Request resolved: #30914 When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. ghstack-source-id: 95144641 Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)

…ensure they are recorded in all cases" recorded in all cases** When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/) [ghstack-poisoned]

…ensure they are recorded in all cases" recorded in all cases** recorded in all cases** When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/) [ghstack-poisoned]

… are recorded in all cases Pull Request resolved: #30914 When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. ghstack-source-id: 95158324 Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)

…ensure they are recorded in all cases" recorded in all cases** When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/) [ghstack-poisoned]

… are recorded in all cases Pull Request resolved: #30914 When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. ghstack-source-id: 95166853 Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)

…ensure they are recorded in all cases" recorded in all cases** recorded in all cases** When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/) [ghstack-poisoned]

… are recorded in all cases Pull Request resolved: #30914 When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. ghstack-source-id: 95178561 Differential Revision: [D18869191](https://our.internmc.facebook.com/intern/diff/D18869191/)

…30914) Summary: Pull Request resolved: #30914 When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since #29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. ghstack-source-id: 95178561 Test Plan: Added a unit test: `test_context_cleanup_tensor_no_grad` Differential Revision: D18869191 fbshipit-source-id: b80f66bfd0dd7d01960abe1691d3f44095bb1b2b

Even though the request might not contain any requires_grad tensor, the return value could. Therefore, we should always include the autograd context id in the request. ghstack-source-id: 75cd6bf Pull Request resolved: pytorch/pytorch#29781

…ytorch#30914) Summary: Pull Request resolved: pytorch#30914 When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since pytorch#29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes. This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds` call to the `getMessageWithAutograd` function. ghstack-source-id: 95178561 Test Plan: Added a unit test: `test_context_cleanup_tensor_no_grad` Differential Revision: D18869191 fbshipit-source-id: b80f66bfd0dd7d01960abe1691d3f44095bb1b2b

Always include autograd context id in rpc/remote requests

cfeaf99

Even though the request might not contain any requires_grad tensor, the return value could. Therefore, we should always include the autograd context id in the request. [ghstack-poisoned]

mrshenli requested a review from pietern as a code owner November 14, 2019 01:02

This was referenced Nov 14, 2019

Allow rpc.remote to create RRef on self #29634

Closed

[WIP] Add end-to-end test for RNN Module #29543

Closed

Update on "Always include autograd context id in rpc/remote requests"

5a1eb7f

Even though the request might not contain any requires_grad tensor, the return value could. Therefore, we should always include the autograd context id in the request. closes #28819 [ghstack-poisoned]

mrshenli requested a review from pritamdamania87 November 14, 2019 01:18

pritamdamania87 reviewed Nov 14, 2019

View reviewed changes

pritamdamania87 suggested changes Nov 14, 2019

View reviewed changes

mrshenli added 2 commits November 13, 2019 22:00

pritamdamania87 approved these changes Nov 14, 2019

View reviewed changes

pritamdamania87 reviewed Nov 14, 2019

View reviewed changes

test/dist_autograd_test.py Show resolved Hide resolved

facebook-github-bot closed this in e1a309a Nov 15, 2019

facebook-github-bot added the merged label Nov 15, 2019

facebook-github-bot deleted the gh/mrshenli/51/head branch November 18, 2019 15:16

rohan-varma mentioned this pull request Dec 7, 2019

[rpc] add the worker IDs outside of addSendRpcBackward to ensure they are recorded in all cases #30914

Closed

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Always include autograd context id in rpc/remote requests #29781

Always include autograd context id in rpc/remote requests #29781

Uh oh!

mrshenli commented Nov 14, 2019 •

edited

Loading

Uh oh!

pritamdamania87 Nov 14, 2019

Uh oh!

mrshenli Nov 14, 2019

Uh oh!

pritamdamania87 left a comment

Uh oh!

Uh oh!

mrshenli commented Nov 14, 2019

Uh oh!

facebook-github-bot commented Nov 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Always include autograd context id in rpc/remote requests #29781

Always include autograd context id in rpc/remote requests #29781

Uh oh!

Conversation

mrshenli commented Nov 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pritamdamania87 Nov 14, 2019

Choose a reason for hiding this comment

Uh oh!

mrshenli Nov 14, 2019

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mrshenli commented Nov 14, 2019

Uh oh!

facebook-github-bot commented Nov 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mrshenli commented Nov 14, 2019 •

edited

Loading