Simple distributed optimizer #29304

aazzolini · 2019-11-06T19:49:19Z

Stack from ghstack:

Simple distributed optimizer #29304 Simple distributed optimizer

Implements a simple python distributed optimizer that takes rrefs to parameters that will be optimized.
It keeps instances of optimizers remotely and calling step on distributed optimizer will call step on each of the remote optimizers in parallel.

Differential Revision: D18354586

Implements a simple python distributed optimizer that takes rrefs to parameters that will be optimized. It keeps instances of optimizers remotely and calling step on distributed optimizer will call step on each of the remote optimizers in parallel. Differential Revision: [D18354586](https://our.internmc.facebook.com/intern/diff/D18354586/) [ghstack-poisoned]

Implements a simple python distributed optimizer that takes rrefs to parameters that will be optimized. It keeps instances of optimizers remotely and calling step on distributed optimizer will call step on each of the remote optimizers in parallel. Differential Revision: [D18354586](https://our.internmc.facebook.com/intern/diff/D18354586/) ghstack-source-id: 93381365 Pull Request resolved: #29304

aazzolini · 2019-11-06T19:51:13Z

This PR replaces #28910 which was closed by mistake and couldn't be reopened.

For this version, I got rid of FunctionalOptimizer and made torch.optim.Optimizer work directly with DistributedOptimizer, simplifying the implementation and reducing new API footprint.

zzzwen · 2019-11-06T21:08:21Z

torch/distributed/optim/optimizer.py

+        kwargs: arguments to pass to the optimizer constructor on each worker.
+    """
+    def __init__(self, optimizer_class, params_rref, *args, **kwargs):
+        per_worker_params_rref = defaultdict(lambda: [])


nit: defaultdict(list)

pietern

Nice! This is very simple indeed :)

pietern · 2019-11-07T11:57:40Z

test/dist_optimizer_test.py

+
+
+if __name__ == '__main__':
+    unittest.main()


If I'm not mistaken this test cannot run by itself, so this should go.

vincentqb · 2019-11-07T14:18:01Z

torch/distributed/optim/optimizer.py

+            [rref.local_value().wait() for rref in local_params_rref],
+            *args,
+            **kwargs)
+        self.lock = Lock()


This is the lock mentioned here, right? How would we modify this to get Hogwild?

Yes. There's no simple way of modifying the current implementation to get hogwild. We'd have to either 1) change the interface (FunctionalOptimizer); or some alternative such as keeping an object pool / or thread local instances of optim.Optimizers to avoid gradient sharing across threads. I can write a comment on why the lock is there.

Cool, just wanted to verify. A quick comment would be a good idea indeed. Thanks!

Wouldn't removing this lock give us some form of hogwild? I'm not sure how accurate this example is: https://pytorch.org/docs/stable/notes/multiprocessing.html#hogwild, although it does something similar where accumulating gradients on the tensors and running the optimizer could interleave among different processes.

albanD · 2019-11-07T17:09:10Z

What is the plan for adding a tutorial / example?
Are you planning on adding this here? Or wait to have all the features to do a single tutorial with a full training?

mrshenli · 2019-11-07T17:51:39Z

What is the plan for adding a tutorial / example?

We need thorough docstrings for this one, then we can have a single tutorial for full training I think.

mrshenli · 2019-11-07T17:55:29Z

torch/distributed/optim/optimizer.py

+        with self.lock:
+            for param, grad in all_local_grads.items():
+                param.grad = grad
+            self.optim.step()


this is nice!

mrshenli · 2019-11-07T17:57:01Z

torch/distributed/optim/optimizer.py

+        params_rref (list[RRef]): list of RRefs to local or remote parameters
+            to optimize.
+        args: arguments to pass to the optimizer constructor on each worker.
+        kwargs: arguments to pass to the optimizer constructor on each worker.


Shall we add an example here?

I think that a convincing example will need a way to 1) create the remote module; 2) call it; 3) get a list of RRef params for it. We can add this after we have closed on a way of doing these.

mrshenli · 2019-11-07T17:58:28Z

torch/distributed/optim/optimizer.py

+    specific parameters.
+
+    Args:
+        optimizer_class (FunctionalOptimizer): the class of optimizer to


There is no FunctionalOptimizer any more, do you mean optim.Optimizer?

mrshenli

This LGTM! My comments are mostly on docs.

mrshenli · 2019-11-07T20:18:21Z

torch/distributed/optim/optimizer.py

+            to optimize.
+        args: arguments to pass to the optimizer constructor on each worker.
+        kwargs: arguments to pass to the optimizer constructor on each worker.
+    """


Do we want to add a warning that we only make sure concurrent step() calls will not modify the same param at the same time, but they could still modify params on different owners in an interleaving way? This means that when an dist optimizer tries to apply a grad x to a param, the param might already be different from when grad x was computed. This behavior is by design. If the application needs to do global exclusive dist optimizer step(), they will have to synchronize it on their own.

Implements a simple python distributed optimizer that takes rrefs to parameters that will be optimized. It keeps instances of optimizers remotely and calling step on distributed optimizer will call step on each of the remote optimizers in parallel. Differential Revision: [D18354586](https://our.internmc.facebook.com/intern/diff/D18354586/) [ghstack-poisoned]

Pull Request resolved: #29304 Implements a simple python distributed optimizer that takes rrefs to parameters that will be optimized. It keeps instances of optimizers remotely and calling step on distributed optimizer will call step on each of the remote optimizers in parallel. ghstack-source-id: 93487483 Differential Revision: [D18354586](https://our.internmc.facebook.com/intern/diff/D18354586/)

mrshenli

Internal test failure is real

Implements a simple python distributed optimizer that takes rrefs to parameters that will be optimized. It keeps instances of optimizers remotely and calling step on distributed optimizer will call step on each of the remote optimizers in parallel. Differential Revision: [D18354586](https://our.internmc.facebook.com/intern/diff/D18354586/) [ghstack-poisoned]

Pull Request resolved: #29304 Implements a simple python distributed optimizer that takes rrefs to parameters that will be optimized. It keeps instances of optimizers remotely and calling step on distributed optimizer will call step on each of the remote optimizers in parallel. ghstack-source-id: 93564364 Differential Revision: [D18354586](https://our.internmc.facebook.com/intern/diff/D18354586/)

mrshenli

Test all passed, let's land this!

facebook-github-bot · 2019-11-12T08:07:48Z

This pull request has been merged in b0cf43b.

aazzolini requested review from apaszke, mrshenli and pietern as code owners November 6, 2019 19:49

aazzolini requested review from pritamdamania87 and vincentqb November 6, 2019 19:51

aazzolini mentioned this pull request Nov 6, 2019

Simple distributed optimizer #28910

Closed

zzzwen reviewed Nov 6, 2019

View reviewed changes

pietern reviewed Nov 7, 2019

View reviewed changes

vincentqb reviewed Nov 7, 2019

View reviewed changes

mrshenli reviewed Nov 7, 2019

View reviewed changes

mrshenli approved these changes Nov 7, 2019

View reviewed changes

vincentqb approved these changes Nov 8, 2019

View reviewed changes

mrshenli requested changes Nov 8, 2019

View reviewed changes

mrshenli approved these changes Nov 10, 2019

View reviewed changes

mrshenli mentioned this pull request Nov 11, 2019

[WIP] Add end-to-end test for RNN Module #29543

Closed

facebook-github-bot closed this in b0cf43b Nov 11, 2019

facebook-github-bot added the merged label Nov 12, 2019

facebook-github-bot deleted the gh/aazzolini/5/head branch November 15, 2019 15:16

mruberry added the Merged label Oct 28, 2020

Simple distributed optimizer #29304

Simple distributed optimizer #29304

Uh oh!

Conversation

aazzolini commented Nov 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aazzolini commented Nov 6, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pietern left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vincentqb Nov 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vincentqb Nov 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albanD commented Nov 7, 2019

Uh oh!

mrshenli commented Nov 7, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Nov 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

aazzolini commented Nov 6, 2019 •

edited

Loading

vincentqb Nov 7, 2019 •

edited

Loading

vincentqb Nov 7, 2019 •

edited

Loading