[WIP] Add end-to-end test for RNN Module #29543

mrshenli · 2019-11-11T10:03:22Z

Stack from ghstack:

[WIP] Add end-to-end test for RNN Module #29543 [WIP] Add end-to-end test for RNN Module
Always include autograd context id in rpc/remote requests #29781 Always include autograd context id in rpc/remote requests
Allow rpc.remote to create RRef on self #29634 Allow rpc.remote to create RRef on self

The test put encoder and decoder on a remote worker, and put the
LSTM module locally. The forward pass 1st looks up the embedding
remotely, then fetchs the emb result and runs it through the local
LSTM module, and finally sends the output to the remote decoder.
The backward pass should automatically traverse through all
involved parties. The optimizer takes a list of param RRefs, and
reaches each owner to update the params.

Differential Revision: D18482428

The test put encoder and decoder on a remote worker, and put the LSTM module locally. The forward pass 1st looks up the embedding remotely, then fetchs the emb result and runs it through the local LSTM module, and finally sends the output to the remote decoder. The backward pass should automatically traverse through all involved parties. The optimizer takes a list of param RRefs, and reaches each owner to update the params. [ghstack-poisoned]

mrshenli · 2019-11-11T10:04:32Z

torch/csrc/distributed/rpc/python_functions.cpp

+      agent,
+      dst,
+      std::move(*pythonCall).toMessage(),
+      true /*forceGradRecording*/);


Adding forceGradRecording because UDFs might not carry any requires_grad tensor, but does requires grad on the return value.

IIUC, this fixes #28819. If so, can we have a simple unit test for the problem this fixes?

Yes, actually let me split this PR into two, with the first one focusing on fixing #28819 with unit test.

mrshenli · 2019-11-11T10:11:03Z

test/dist_model_parallel_test.py

+        rnn = RNNModel(ps, ntoken, ninp, nhid, nlayers)
+        # Depends on #29304 and #28948
+        #opt = DistributedOptimizer(
+        #    optim.SGD,
+        #    rnn.remote_parameters(),
+        #    lr=0.05,
+        #)
+        with dist_autograd.context() as ctx_id:
+            inp = torch.LongTensor(batch, nindices) % ntoken
+            output, hidden = rnn(inp, hidden)
+            dist_autograd.backward([output.sum()])
+            #opt.step()


cc @soumith

This would be what we expect users to write to train an RNN module. Will remove comments on DistributedOptimizer once #29304 and #28948 are in.

The test put encoder and decoder on a remote worker, and put the LSTM module locally. The forward pass 1st looks up the embedding remotely, then fetchs the emb result and runs it through the local LSTM module, and finally sends the output to the remote decoder. The backward pass should automatically traverse through all involved parties. The optimizer takes a list of param RRefs, and reaches each owner to update the params. [ghstack-poisoned]

The test put encoder and decoder on a remote worker, and put the LSTM module locally. The forward pass 1st looks up the embedding remotely, then fetchs the emb result and runs it through the local LSTM module, and finally sends the output to the remote decoder. The backward pass should automatically traverse through all involved parties. The optimizer takes a list of param RRefs, and reaches each owner to update the params. ghstack-source-id: 6aa362a Pull Request resolved: #29543

rohan-varma · 2019-11-11T16:49:54Z

test/dist_model_parallel_test.py

+        args=[RemoteModule.remote_parameters, remote_module_rref]
+    )
+
+class Encoder(RemoteModule):


nit, should this be called something like EmbeddingLayer or RemoteEmbeddingLayer? Usually in seq2seq models encoder refers to the RNN that creates the encoded representation, which in this case would be the LSTM.

pritamdamania87 · 2019-11-11T19:13:39Z

test/dist_model_parallel_test.py

+# from torch.distributed.optim import DistributedOptimizer
+# from torch.distributed.rpc import RRef
+
+from dist_utils import INIT_METHOD_TEMPLATE, dist_init, TEST_CONFIG


Looks like we have a bunch of lint failures.

pritamdamania87 · 2019-11-11T19:26:30Z

test/dist_model_parallel_test.py

@@ -0,0 +1,138 @@
+import torch


Wasn't the plan to have this in the examples repository? Why is this a unit test instead?

We need RPC module tests as well. I would first like to check with @soumith on whether this is a reasonable example, and whether there are any APIs that we need to revise. And also waiting for the dist optimizer to be landed (it's landing now).

pritamdamania87 · 2019-11-11T19:32:30Z

test/test_dist_model_parallel_spawn.py

@@ -0,0 +1,18 @@
+#!/usr/bin/env python3
+from __future__ import absolute_import, division, print_function, unicode_literals


Why do we have two separate files if we only have a spawn mode?

Let me merge this into one file.

pritamdamania87 · 2019-11-11T19:33:25Z

torch/csrc/distributed/autograd/utils.cpp

-    TORCH_INTERNAL_ASSERT(
-        torch::autograd::compute_requires_grad(tensors),
-        "Received tensors do not require grad, addRecvRpcBackward should not be called");
+  if (!tensors.empty() && torch::autograd::compute_requires_grad(tensors)) {


Were the changes here and in python_functions.cpp bugs that were discovered by the unit test in this PR?

Yes, I hit errors when running unit test in this PR, but it's kind of expected as we already have #28819 to track this.

pritamdamania87 · 2019-11-11T19:36:57Z

torch/csrc/distributed/rpc/python_functions.cpp

+      agent,
+      dst,
+      std::move(*pythonCall).toMessage(),
+      true /*forceGradRecording*/);


IIUC, this fixes #28819. If so, can we have a simple unit test for the problem this fixes?

aazzolini

I believe if we do use RemoteModule, it should be a wrapper that lives on the "client" side, not on the "server" side. The way the example is presented here, I'd rather just not introduce the concept of RemoteModule and just call into nn.Module directly.

aazzolini · 2019-11-12T01:06:44Z

test/dist_model_parallel_test.py

+        kwargs=kwargs
+    )
+
+class RemoteModule(nn.Module):


What is the reason for RemoteModule in here? I believe RemoteModule only makes sense as a wrapper module that wraps regular modules. The way it's exposed here you'll still have to implement a special module that inherits from RemoteModule, so there's not much utility for this class.

If the only functionality is exposing remote_parameters(), then simply make it a free function, e.g.:

def get_module_param_rrefs(nn_module: NNModule):
return [RRef(param) for param in self.parameters()]

If the goal is to discover remote modules, you could have a RRef wrapper on the client side instead (not on the remote side).

aazzolini · 2019-11-12T01:07:40Z

test/dist_model_parallel_test.py

+        args=[RemoteModule.remote_parameters, remote_module_rref]
+    )
+
+class Encoder(RemoteModule):


RemoteModule here is a misnomer, this is a perfectly well defined local module actually. It just happens to be used as a remote module somewhere. Nothing in the implementation of this module implies remote.

aazzolini · 2019-11-12T01:07:56Z

test/dist_model_parallel_test.py

+        return self.drop(self.encoder(input))
+
+
+class RNN(RemoteModule):


The test put encoder and decoder on a remote worker, and put the LSTM module locally. The forward pass 1st looks up the embedding remotely, then fetchs the emb result and runs it through the local LSTM module, and finally sends the output to the remote decoder. The backward pass should automatically traverse through all involved parties. The optimizer takes a list of param RRefs, and reaches each owner to update the params. [ghstack-poisoned]

The test put encoder and decoder on a remote worker, and put the LSTM module locally. The forward pass 1st looks up the embedding remotely, then fetchs the emb result and runs it through the local LSTM module, and finally sends the output to the remote decoder. The backward pass should automatically traverse through all involved parties. The optimizer takes a list of param RRefs, and reaches each owner to update the params. ghstack-source-id: 8482522 Pull Request resolved: #29543

The test put encoder and decoder on a remote worker, and put the LSTM module locally. The forward pass 1st looks up the embedding remotely, then fetchs the emb result and runs it through the local LSTM module, and finally sends the output to the remote decoder. The backward pass should automatically traverse through all involved parties. The optimizer takes a list of param RRefs, and reaches each owner to update the params. Differential Revision: [D18482428](https://our.internmc.facebook.com/intern/diff/D18482428) [ghstack-poisoned]

The test put encoder and decoder on a remote worker, and put the LSTM module locally. The forward pass 1st looks up the embedding remotely, then fetchs the emb result and runs it through the local LSTM module, and finally sends the output to the remote decoder. The backward pass should automatically traverse through all involved parties. The optimizer takes a list of param RRefs, and reaches each owner to update the params. ghstack-source-id: e7700dc Pull Request resolved: #29543

The test put encoder and decoder on a remote worker, and put the LSTM module locally. The forward pass 1st looks up the embedding remotely, then fetchs the emb result and runs it through the local LSTM module, and finally sends the output to the remote decoder. The backward pass should automatically traverse through all involved parties. The optimizer takes a list of param RRefs, and reaches each owner to update the params. Differential Revision: [D18482428](https://our.internmc.facebook.com/intern/diff/D18482428) [ghstack-poisoned]

The test put encoder and decoder on a remote worker, and put the LSTM module locally. The forward pass 1st looks up the embedding remotely, then fetchs the emb result and runs it through the local LSTM module, and finally sends the output to the remote decoder. The backward pass should automatically traverse through all involved parties. The optimizer takes a list of param RRefs, and reaches each owner to update the params. ghstack-source-id: 5678e2a Pull Request resolved: #29543

The test put encoder and decoder on a remote worker, and put the LSTM module locally. The forward pass 1st looks up the embedding remotely, then fetchs the emb result and runs it through the local LSTM module, and finally sends the output to the remote decoder. The backward pass should automatically traverse through all involved parties. The optimizer takes a list of param RRefs, and reaches each owner to update the params. Differential Revision: [D18482428](https://our.internmc.facebook.com/intern/diff/D18482428) [ghstack-poisoned]

The test put encoder and decoder on a remote worker, and put the LSTM module locally. The forward pass 1st looks up the embedding remotely, then fetchs the emb result and runs it through the local LSTM module, and finally sends the output to the remote decoder. The backward pass should automatically traverse through all involved parties. The optimizer takes a list of param RRefs, and reaches each owner to update the params. ghstack-source-id: 033f623 Pull Request resolved: #29543

The test put encoder and decoder on a remote worker, and put the LSTM module locally. The forward pass 1st looks up the embedding remotely, then fetchs the emb result and runs it through the local LSTM module, and finally sends the output to the remote decoder. The backward pass should automatically traverse through all involved parties. The optimizer takes a list of param RRefs, and reaches each owner to update the params. Differential Revision: [D18482428](https://our.internmc.facebook.com/intern/diff/D18482428) [ghstack-poisoned]

The test put encoder and decoder on a remote worker, and put the LSTM module locally. The forward pass 1st looks up the embedding remotely, then fetchs the emb result and runs it through the local LSTM module, and finally sends the output to the remote decoder. The backward pass should automatically traverse through all involved parties. The optimizer takes a list of param RRefs, and reaches each owner to update the params. ghstack-source-id: 8ec4c84 Pull Request resolved: #29543

rohan-varma · 2019-11-17T00:35:28Z

test/test_dist_model_parallel_spawn.py

+import unittest
+
+@unittest.skipIf(TEST_WITH_ASAN, "Skip ASAN as torch + multiprocessing spawn have known issues")
+class DistModelParallelSpawn(MultiProcessTestCase, DistModelParallelTest):


curious, is there a reason we don't have a fork mode test for this?

rohan-varma · 2019-11-22T06:05:39Z

Are we still planning to merge this eventually? Would be super useful to have an end to end test to sanity check changes we make in the RPC layer.

mrshenli requested a review from pietern as a code owner November 11, 2019 10:03

mrshenli commented Nov 11, 2019

View reviewed changes

mrshenli requested review from pritamdamania87 and soumith November 11, 2019 10:11

rohan-varma reviewed Nov 11, 2019

View reviewed changes

rohan-varma mentioned this pull request Nov 11, 2019

Add simple tutorial for model parallel training #29537

Closed

pritamdamania87 reviewed Nov 11, 2019

View reviewed changes

aazzolini reviewed Nov 12, 2019

View reviewed changes

mrshenli mentioned this pull request Nov 13, 2019

Allow rpc.remote to create RRef on self #29634

Closed

mrshenli mentioned this pull request Nov 14, 2019

Always include autograd context id in rpc/remote requests #29781

Closed

rohan-varma reviewed Nov 17, 2019

View reviewed changes

facebook-github-bot closed this Jul 13, 2020

facebook-github-bot deleted the gh/mrshenli/49/head branch August 13, 2020 14:15

		@@ -0,0 +1,18 @@
		#!/usr/bin/env python3
		from __future__ import absolute_import, division, print_function, unicode_literals

		return self.drop(self.encoder(input))


		class RNN(RemoteModule):

[WIP] Add end-to-end test for RNN Module #29543

[WIP] Add end-to-end test for RNN Module #29543

Uh oh!

Conversation

mrshenli commented Nov 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aazzolini left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohan-varma commented Nov 22, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mrshenli commented Nov 11, 2019 •

edited

Loading