Simple distributed optimizer #28910

aazzolini · 2019-10-30T20:10:09Z

Stack from ghstack:

Allow to create local RRef with value #28948 Allow to create local RRef with value
Simple distributed optimizer #28910 Simple distributed optimizer

Implements a simple python distributed optimizer that takes rrefs to parameters that will be optimized.
It keeps instances of optimizers remotely and calling step on distributed optimizer will call step on each of the remote optimizers in parallel.

Differential Revision: D18230877

Implements a simple python distributed optimizer that takes rrefs to parameters that will be optimized. It keeps instances of optimizers remotely and calling step on distributed optimizer will call step on each of the remote optimizers in parallel. Differential Revision: [D18230877](https://our.internmc.facebook.com/intern/diff/D18230877/) [ghstack-poisoned]

Implements a simple python distributed optimizer that takes rrefs to parameters that will be optimized. It keeps instances of optimizers remotely and calling step on distributed optimizer will call step on each of the remote optimizers in parallel. Differential Revision: [D18230877](https://our.internmc.facebook.com/intern/diff/D18230877/) ghstack-source-id: 92927594 Pull Request resolved: #28910

aazzolini · 2019-10-30T22:56:33Z

@pytorchbot retest this please

pritamdamania87 · 2019-10-31T03:27:43Z

test/dist_optimizer_test.py

+        raise ValueError('Error running optimizer.')
+
+
+def _call_meth(meth, obj_rref, *args, **kwargs):


nit: Can we use method/func instead of meth? :)

pritamdamania87 · 2019-10-31T03:29:49Z

test/dist_optimizer_test.py

+
+def rpc_async_meth(meth, obj_rref, *args, **kwargs):
+    """
+    Call rpc.remote on a method in a remote object.


nit: Fix docs, its the same as remote_meth above. I believe it should be rpc.async here.

pritamdamania87 · 2019-10-31T03:32:16Z

test/dist_optimizer_test.py

+
+    @dist_init()
+    def test_dist_optim(self):
+        if self.rank != 0:


Why are we running this only on one node?

No specific reason, will remove.

pritamdamania87 · 2019-10-31T03:35:50Z

test/test_dist_optimizer_spawn.py

+
+import unittest
+
+@unittest.skipIf(TEST_WITH_ASAN, "Skip ASAN as torch + multiprocessing spawn have known issues")


We should have a test_dist_optimizer_fork.py file to ensure we still run ASAN. Or we should run opt-asan for spawn

pritamdamania87 · 2019-10-31T03:36:42Z

torch/distributed/optimizer/dist_optimizer.py

+from collections import defaultdict
+
+class FunctionalOptimizer:
+    """Base class for functional optimizers.


nit: Start doc on next line:

""" Base class

pritamdamania87 · 2019-10-31T03:58:44Z

torch/distributed/optimizer/dist_optimizer.py

+        raise NotImplementedError
+
+
+class FunctionalSGD(FunctionalOptimizer):


Can we put this class in a separate file?

pritamdamania87 · 2019-10-31T04:00:52Z

torch/distributed/optimizer/dist_optimizer.py

+        args: arguments to pass to the optimizer constructor on each worker.
+        kwargs: arguments to pass to the optimizer constructor on each worker.


Why don't we pass an instance of the optimizer instead of the class and its args? We can create an RRef of this instance on the remote node by just passing it in as a parameter to a remote call and just returning the same object back.

This won't be possible -- optimizer objects (both torch.optim and the FunctionalOptimizer that I introduced above) take local model parameters as input to the constructor. However, we don't have access to those parameters from here. The worker where the parameters live is the only one capable of passing those parameters to the constructor of the optimizer.

Alternativelly we could introduce an OptimizerConfig class but it wouldn't solve the underlying issue.

pritamdamania87 · 2019-10-31T04:02:59Z

torch/distributed/optimizer/dist_optimizer.py

+            self.remote_optimizers.append(remote_optim)
+
+
+    def step(self, autograd_ctx_id):


Should we keep this API inline with backward where we implicitly use the current context id instead of passing it in?

The issue is that dist_autograd._current_context() is private. Should we expose it publicly? (In a subsequent PR).

Can we keep it private and still call it from here? There is no reason to expose this method to end users at the moment.

pritamdamania87 · 2019-10-31T04:04:08Z

torch/distributed/optimizer/dist_optimizer.py

+        try:
+            fut.wait()
+        except Exception as e:
+            exception = e


What is the reason that we continue waiting for other futures if we see an exception? Shouldn't we just exit as soon as possible and not wait for other futures?

I'm just not sure what would happen if one of these operations is still executing remotely and we exit the current autograd context.

pritamdamania87 · 2019-10-31T04:05:02Z

torch/distributed/optimizer/dist_optimizer.py

+
+
+def _wait_for_all(rpc_futs):
+    # TODO: improve error propagation


What sort of improvements do we have in mind here? Also, can we create a github issue for this so we can keep track of this for the future?

Ideally we want to gather all exceptions in a list. I'll open an issue for it.

Implements a simple python distributed optimizer that takes rrefs to parameters that will be optimized. It keeps instances of optimizers remotely and calling step on distributed optimizer will call step on each of the remote optimizers in parallel. Differential Revision: [D18230877](https://our.internmc.facebook.com/intern/diff/D18230877/) [ghstack-poisoned]

kostmo · 2019-10-31T06:27:56Z

CircleCI build failures summary

As of commit a8acb05:

0/1 flaky
1/1 failures introduced in this PR

Here are the reasons each build failed.

This comment was automatically generated by Dr. CI.
Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 1 time(s).

mrshenli · 2019-10-31T14:01:51Z

test/dist_optimizer_test.py

+from dist_utils import INIT_METHOD_TEMPLATE, dist_init
+import torch
+import torch.distributed.autograd as dist_autograd
+import torch.distributed.optimizer.dist_optimizer as dist_optimizer


The corresponding package and class is called torch.optim.Optimizer, do we want to keep the same convention here? i.e., torch.distributed.optim.Optimizer.

mrshenli · 2019-10-31T14:09:00Z

test/dist_optimizer_test.py

+
+
+@unittest.skipIf(
+    not torch._six.PY3, "Pytorch distributed autograd package " "does not support python2"


two unnecessary double quotes, or maybe you were intended to break them into shorter lines.

mrshenli · 2019-10-31T14:13:46Z

test/dist_optimizer_test.py

+        remote_param1 = remote_meth(MyModule.get_w, remote_module1)
+        remote_param2 = remote_meth(MyModule.get_w, remote_module2)
+
+        dst_optim = dist_optimizer.DistributedOptimizer(


dst_optim -> dist_optim?

mrshenli · 2019-10-31T14:16:40Z

test/dist_optimizer_test.py

+            t2 = torch.rand((3, 3), requires_grad=True)
+            output1 = rpc_async_meth(MyModule.forward, remote_module1, t2)
+            output2 = rpc_async_meth(
+                MyModule.forward, remote_module2, output1.wait())


In followup PRs, we also want to test passing an RRef of output1 to remote_module2, right?

Correct, I was waiting for the .to_here() autograd propagation PR to be landed. If it's already the case, I can add the test to this PR.

It's landed, but we can still add new tests in a PR on top of this. There are some flakiness that we are investigating, which should not block us from landing this PR in its current form.

mrshenli · 2019-10-31T14:28:15Z

torch/distributed/optimizer/dist_optimizer.py

+    def __init__(self, params):
+        self.params = params
+
+    def step(self, gradients):


Any reason for directly taking the gradients here instead of a distributed autograd context id, and use that id to retrieve gradients?

Notice that this class is a local optimizer, it doesn't know about distributed anything.

mrshenli · 2019-10-31T14:29:53Z

torch/distributed/optimizer/dist_optimizer.py

+
+from collections import defaultdict
+
+class FunctionalOptimizer:


Would I be correct if I assume there is no easy way to reuse existing optimizers without modifying them as they directly reads from param.grad?

I am thinking of writing an adapter that allows to use existing optim.Optimizer. In order to avoid race conditions I'll have to create multiple instance of optim.Optimizer taking the same parameters but unsharing the gradients. I can open an issue explaining the idea in more details.

mrshenli · 2019-10-31T14:43:04Z

torch/distributed/optimizer/dist_optimizer.py

+    specific parameters.
+
+    Args:
+        optimizer_class (FunctionalOptimizer): the class of optimizer to


Can the constructor take a torch.optim class as input instead?

I also would prefer that way.

hogwild case there could be a race in the proposal above (I'm not sure if that matters though).

This probably won't work in hogwild without a lock as grad might be overwritten by other threads, but as hogwild is a more advanced use case, it seems reasonable to ask hogwild applications to implement their own local optimizers that do not directly read from param.grad? We could have a mode flag here to toggle whether we want to use a lock.

mrshenli · 2019-10-31T14:44:00Z

torch/distributed/optimizer/dist_optimizer.py

+    Args:
+        optimizer_class (FunctionalOptimizer): the class of optimizer to
+            instantiate on each worker.
+        params_rref (list[RRef]): list of RRefs to local or remote parameters


Would I be correct if I assume that the RemoteModule we discussed offline should have a parameters() API that returns a list of RRefs?

Correct. I didn't want to introduce a full fledged RemoteModule in this diff; also we'll need a prep PR before we can return a list of RRefs.

Implements a simple python distributed optimizer that takes rrefs to parameters that will be optimized. It keeps instances of optimizers remotely and calling step on distributed optimizer will call step on each of the remote optimizers in parallel. Differential Revision: [D18230877](https://our.internmc.facebook.com/intern/diff/D18230877/) [ghstack-poisoned]

ezyang · 2019-10-31T17:43:36Z

Hi I was tagged reviewer here. It looks like there are already reviews. Is there any specific you want me to look at?

pritamdamania87 · 2019-10-31T23:04:29Z

@ezyang I tagged you as a reviewer since this is somewhat related to distributed autograd, although looks like @vincentqb @soumith are the owners for torch.optim based on this: https://pytorch.org/docs/stable/community/persons_of_interest.html#torch-optim

Implements a simple python distributed optimizer that takes rrefs to parameters that will be optimized. It keeps instances of optimizers remotely and calling step on distributed optimizer will call step on each of the remote optimizers in parallel. Differential Revision: [D18230877](https://our.internmc.facebook.com/intern/diff/D18230877/) [ghstack-poisoned]

ezyang · 2019-11-01T14:44:33Z

Yes, why don't you tag @vincentqb for review her, for optimizer notes.

mrshenli · 2019-11-01T14:51:05Z

Hey @vincentqb, could you please help to take a look at the optimizer API in this PR?

vincentqb · 2019-11-01T17:44:02Z

Can the constructor take a torch.optim class as input instead?

What was the conclusion of this discussion?

As mentioned above, it would be really nice to have a mechanism to re-use the current optimizers. Wrappers to offer {,non-}locking could be nice, but a little heavy to use. Offering a default through a toggle may be a solution if we don't expect more toggles to appear with time.

Also, if possible to keep the syntax as close as possible, it'd be nice to have a way to simply replace opt = torch.optim.SGD(...) by opt = torch.distributed.optim.FunctionalSGD(...) to get a FunctionalOptimizer version of an Optimizer. I highlighted some differences in the parameters passed.

Thoughts?

vincentqb · 2019-11-01T15:52:44Z

test/dist_optimizer_test.py

+
+
+class FunctionalSGD(FunctionalOptimizer):
+    """Simplistic implementation of Stocastic Gradient Descent optimizer.


nit: Stochastic

vincentqb · 2019-11-01T17:22:15Z

torch/distributed/optim/optimizer.py

+    def __init__(self, params):
+        self.params = params
+
+    def step(self, gradients):


The current Optimizer takes and returns the loss function. What would be an equivalent here?

Unfortunately there's no straightforward way of porting the closure() functionality to distributed as it would potentially require algorithm-specific synchronization across workers. This would be true for L-FBGS for example, where closure() and parameter updates are interleaved in a loop inside of step().

vincentqb · 2019-11-01T17:25:28Z

torch/distributed/optim/optimizer.py

+                       matters as the list of gradients passed to the step
+                       function must be aligned with this list.
+    """
+    def __init__(self, params):


Would having "defaults" make sense here to mimic Optimizer?

vincentqb · 2019-11-01T17:26:12Z

torch/distributed/optim/optimizer.py

+            self.remote_optimizers.append(remote_optim)
+
+
+    def step(self, autograd_ctx_id):


Same comment as FunctionalOptimizer.step

vincentqb · 2019-11-01T17:31:50Z

torch/distributed/optim/optimizer.py

+
+from collections import defaultdict
+
+class FunctionalOptimizer:


Is this precise enough as a name? Or could we name it differently, say LocalOptimizer or NonLockingOptimizer?

aazzolini · 2019-11-04T22:22:28Z

Proposal:

I'll rename FunctionalOptimizer to LocalOptimizer
I'll keep LocalOptimizer as simple as it is right now, in particular:
- I won't add "defaults" as argument constructor as 1) it would require code to be added to the base class but I'd prefer to keep it interface-only; 2) it would be unclear how to wrap an existing Optimizer as a LocalOptimizer since we'd have 2 instances of the "defaults" field. LocalOptimizer interface-only makes it cleaner to implement as a wrapper;
I won't add the "closure" argument to step() because there's no sensible default way of implementing this functionality for distributed optimizers.
I'll implement a thin "LockingOptimizer" wrapper that will wrap a optim.Optimizer as a LocalOptimizer. If a optim.Optimzier is passed to DistributedOptimizer it will be automatically wrapped.
I won't re-implement any specific Optimizer on top of LocalOptimizer (except in test cases).

Let me know if that works.

aazzolini · 2019-11-05T20:37:04Z

@vincentqb could you comment on this proposal since we'd like to have this PR landed soon for 1.4 release?

aazzolini · 2019-11-05T20:54:46Z

Synced up offline with @mrshenli -- instead i'll go the simpler route of integrating with "optim.Optimizer" directly to avoid introducing new APIs.

vincentqb · 2019-11-05T22:57:51Z

@aazzolini -- did you mean to close this PR?

@vincentqb could you comment on this proposal since we'd like to have this PR landed soon for 1.4 release?

The proposal, with the amendment, sounds good to me. Thanks for working on this!

Pull Request resolved: #28910 Implements a simple python distributed optimizer that takes rrefs to parameters that will be optimized. It keeps instances of optimizers remotely and calling step on distributed optimizer will call step on each of the remote optimizers in parallel. ghstack-source-id: 93377263 Differential Revision: [D18230877](https://our.internmc.facebook.com/intern/diff/D18230877/)

aazzolini · 2019-11-06T19:45:39Z

This is weird, I must have closed this by mistake. It won't let me re-open it mentioning the commits have already been merged but that doesn't seem to be the case.
I'll open a new PR.

aazzolini · 2019-11-06T19:52:55Z

@vincentqb @pritamdamania87 @mrshenli I produced #29304 to continue the discussion.

aazzolini requested review from apaszke, mrshenli and pietern as code owners October 30, 2019 20:10

aazzolini mentioned this pull request Oct 30, 2019

PyRRef.owner() to return WorkerInfo #28909

Closed

pritamdamania87 self-requested a review October 30, 2019 22:54

pritamdamania87 requested a review from ezyang October 31, 2019 03:25

pritamdamania87 reviewed Oct 31, 2019

View reviewed changes

aazzolini mentioned this pull request Oct 31, 2019

Allow to create local RRef with value #28948

Closed

mrshenli reviewed Oct 31, 2019

View reviewed changes

mrshenli requested a review from vincentqb November 1, 2019 14:50

vincentqb reviewed Nov 1, 2019

View reviewed changes

aazzolini closed this Nov 5, 2019

aazzolini mentioned this pull request Nov 6, 2019

Simple distributed optimizer #29304

Closed

		raise ValueError('Error running optimizer.')


		def _call_meth(meth, obj_rref, args, *kwargs):


		import unittest

		@unittest.skipIf(TEST_WITH_ASAN, "Skip ASAN as torch + multiprocessing spawn have known issues")

		raise NotImplementedError


		class FunctionalSGD(FunctionalOptimizer):

		args: arguments to pass to the optimizer constructor on each worker.
		kwargs: arguments to pass to the optimizer constructor on each worker.

		self.remote_optimizers.append(remote_optim)


		def step(self, autograd_ctx_id):



		def _wait_for_all(rpc_futs):
		# TODO: improve error propagation



		@unittest.skipIf(
		not torch._six.PY3, "Pytorch distributed autograd package " "does not support python2"


		from collections import defaultdict

		class FunctionalOptimizer:



		class FunctionalSGD(FunctionalOptimizer):
		"""Simplistic implementation of Stocastic Gradient Descent optimizer.

Simple distributed optimizer #28910

Simple distributed optimizer #28910

Uh oh!

Conversation

aazzolini commented Oct 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aazzolini commented Oct 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kostmo commented Oct 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CircleCI build failures summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang commented Oct 31, 2019

Uh oh!

pritamdamania87 commented Oct 31, 2019

Uh oh!

ezyang commented Nov 1, 2019

Uh oh!

mrshenli commented Nov 1, 2019

Uh oh!

vincentqb commented Nov 1, 2019

aazzolini commented Oct 30, 2019 •

edited

Loading

kostmo commented Oct 31, 2019 •

edited

Loading