Fix distributed autograd initialization. #29069

pritamdamania87 · 2019-11-01T22:02:29Z

Stack from ghstack:

Fix distributed autograd initialization. #29069 Fix distributed autograd initialization.

Distributed autograd was initialized after RPC and this would cause a
race in some scenarios where one node might have initialized distributed
autograd, calls backward() but other nodes have not initialized distributed
autograd yet.

Moving this before _init_rpc fixes the problem since _init_rpc implicitly
has a sync between processes via the store.

Differential Revision: D18280875

Distributed autograd was initialized after RPC and this would cause a race in some scenarios where one node might have initialized distributed autograd, calls backward() but other nodes have not initialized distributed autograd yet. Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly has a sync between processes via the store. Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/) [ghstack-poisoned]

Distributed autograd was initialized after RPC and this would cause a race in some scenarios where one node might have initialized distributed autograd, calls backward() but other nodes have not initialized distributed autograd yet. Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly has a sync between processes via the store. Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/) ghstack-source-id: 93117986 Pull Request resolved: #29069

pietern

LGTM

pietern · 2019-11-04T12:51:26Z

torch/distributed/rpc/__init__.py

        store, _, _ = next(rendezvous_iterator)

+        # Initialize autograd before RPC since _init_rpc guarantees all
+        # processes sync via the store. If we initialize autograd after rpc,


RPC and rpc

pietern

CI is failing.

The RPC initialization tests may crash because you now end up initializing distributed autograd twice.

mrshenli · 2019-11-05T22:24:40Z

This would close #29156, ~~#29117~~ and #29212

mrshenli · 2019-11-05T22:41:13Z

If #29157 is merged before this one, could you enable those two tests?

rohan-varma · 2019-11-05T23:08:38Z

As @mrshenli mentioned, I went ahead and disabled some of these tests in #29157 while we fix them.

zhaojuanmao · 2019-11-05T23:25:02Z

it will close #28932 as well

mrshenli · 2019-11-07T00:36:14Z

Shall we get this landed soon?

mrshenli · 2019-11-07T22:00:00Z

This will also close #29410

Distributed autograd was initialized after RPC and this would cause a race in some scenarios where one node might have initialized distributed autograd, calls backward() but other nodes have not initialized distributed autograd yet. Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly has a sync between processes via the store. Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/) [ghstack-poisoned]

Pull Request resolved: #29069 Distributed autograd was initialized after RPC and this would cause a race in some scenarios where one node might have initialized distributed autograd, calls backward() but other nodes have not initialized distributed autograd yet. Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly has a sync between processes via the store. ghstack-source-id: 93505965 Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/)

mrshenli

This LGTM. Just had a question regarding init_rpc.

mrshenli · 2019-11-08T14:45:12Z

test/rpc_test.py

        with self.assertRaisesRegex(RuntimeError, "is not unique"):
-            rpc.init_model_parallel(
-                self_name="duplicate_name",
+            store, _, _ = next(torch.distributed.rendezvous(


why calling the lower level APIs instead of init_model_parallel?

With the current changes init_model_parallel throws a KeyError because the worker_name_to_id[self_name] lookup fails for invalid strings. I call the lower level API here to reproduce the errors we're looking for.

I see, we will need to update these tests after we move worker_name_to_id to backend_config dict.

Distributed autograd was initialized after RPC and this would cause a race in some scenarios where one node might have initialized distributed autograd, calls backward() but other nodes have not initialized distributed autograd yet. Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly has a sync between processes via the store. Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/) [ghstack-poisoned]

Pull Request resolved: #29069 Distributed autograd was initialized after RPC and this would cause a race in some scenarios where one node might have initialized distributed autograd, calls backward() but other nodes have not initialized distributed autograd yet. Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly has a sync between processes via the store. ghstack-source-id: 93535922 Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/)

facebook-github-bot · 2019-11-08T20:12:15Z

This pull request has been merged in 5e1983f.

pritamdamania87 requested review from apaszke, mrshenli and pietern as code owners November 1, 2019 22:02

pritamdamania87 mentioned this pull request Nov 1, 2019

Additional autograd unit tests for Python UDFs. #29041

Closed

pritamdamania87 requested review from satgera and xush6528 November 1, 2019 22:02

pietern approved these changes Nov 4, 2019

View reviewed changes

pietern suggested changes Nov 4, 2019

View reviewed changes

mrshenli mentioned this pull request Nov 7, 2019

[tracking issue] RPC tests are flaky #29117

Closed

mrshenli mentioned this pull request Nov 7, 2019

RpcTestWithSpawn.test_stress_heavy_rpc is flaky in CI #29410

Closed

mrshenli approved these changes Nov 8, 2019

View reviewed changes

facebook-github-bot closed this in 5e1983f Nov 8, 2019

facebook-github-bot added the merged label Nov 8, 2019

This was referenced Nov 11, 2019

Enable test_py_built_in in rpc_test.py #29470

Closed

Enable test_multi_py_udf_remote in rpc_test.py #29471

Closed

Enable test_py_rref_args_user_share in rpc_test.py #29472

Closed

mrshenli mentioned this pull request Nov 11, 2019

Enable test_stress_light_rpc in rpc_test.py #29473

Closed

facebook-github-bot deleted the gh/pritamdamania87/21/head branch November 12, 2019 15:17

This was referenced Nov 18, 2019

test_py_rref_args_user_share rpc test is flaky #29212

Closed

test_multi_py_udf_remote rpc test is flay #29156

Closed

test_py_built_in rpc test is flaky #28932

Closed

mruberry added the Merged label Oct 28, 2020

Fix distributed autograd initialization. #29069

Fix distributed autograd initialization. #29069

Uh oh!

Conversation

pritamdamania87 commented Nov 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pietern left a comment

Choose a reason for hiding this comment

Uh oh!

pietern Nov 4, 2019

Choose a reason for hiding this comment

Uh oh!

pietern left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli commented Nov 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrshenli commented Nov 5, 2019

Uh oh!

rohan-varma commented Nov 5, 2019

Uh oh!

zhaojuanmao commented Nov 5, 2019

Uh oh!

mrshenli commented Nov 7, 2019

Uh oh!

mrshenli commented Nov 7, 2019

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli Nov 8, 2019

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 Nov 8, 2019

Choose a reason for hiding this comment

Uh oh!

mrshenli Nov 8, 2019

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Nov 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

pritamdamania87 commented Nov 1, 2019 •

edited

Loading

mrshenli commented Nov 5, 2019 •

edited

Loading