Skip to content

Conversation

@pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Nov 1, 2019

Stack from ghstack:

Distributed autograd was initialized after RPC and this would cause a
race in some scenarios where one node might have initialized distributed
autograd, calls backward() but other nodes have not initialized distributed
autograd yet.

Moving this before _init_rpc fixes the problem since _init_rpc implicitly
has a sync between processes via the store.

Differential Revision: D18280875

Distributed autograd was initialized after RPC and this would cause a
race in some scenarios where one node might have initialized distributed
autograd, calls backward() but other nodes have not initialized distributed
autograd yet.

Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly
has a sync between processes via the store.

Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Nov 1, 2019
Distributed autograd was initialized after RPC and this would cause a
race in some scenarios where one node might have initialized distributed
autograd, calls backward() but other nodes have not initialized distributed
autograd yet.

Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly
has a sync between processes via the store.

Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/)

ghstack-source-id: 93117986
Pull Request resolved: #29069
Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

store, _, _ = next(rendezvous_iterator)

# Initialize autograd before RPC since _init_rpc guarantees all
# processes sync via the store. If we initialize autograd after rpc,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RPC and rpc

Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is failing.

The RPC initialization tests may crash because you now end up initializing distributed autograd twice.

@mrshenli
Copy link
Contributor

mrshenli commented Nov 5, 2019

This would close #29156, #29117 and #29212

@mrshenli
Copy link
Contributor

mrshenli commented Nov 5, 2019

If #29157 is merged before this one, could you enable those two tests?

@rohan-varma
Copy link
Contributor

As @mrshenli mentioned, I went ahead and disabled some of these tests in #29157 while we fix them.

@zhaojuanmao
Copy link
Contributor

it will close #28932 as well

@mrshenli
Copy link
Contributor

mrshenli commented Nov 7, 2019

Shall we get this landed soon?

@mrshenli
Copy link
Contributor

mrshenli commented Nov 7, 2019

This will also close #29410

Distributed autograd was initialized after RPC and this would cause a
race in some scenarios where one node might have initialized distributed
autograd, calls backward() but other nodes have not initialized distributed
autograd yet.

Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly
has a sync between processes via the store.

Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Nov 8, 2019
Pull Request resolved: #29069

Distributed autograd was initialized after RPC and this would cause a
race in some scenarios where one node might have initialized distributed
autograd, calls backward() but other nodes have not initialized distributed
autograd yet.

Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly
has a sync between processes via the store.
ghstack-source-id: 93505965

Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/)
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. Just had a question regarding init_rpc.

with self.assertRaisesRegex(RuntimeError, "is not unique"):
rpc.init_model_parallel(
self_name="duplicate_name",
store, _, _ = next(torch.distributed.rendezvous(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why calling the lower level APIs instead of init_model_parallel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current changes init_model_parallel throws a KeyError because the worker_name_to_id[self_name] lookup fails for invalid strings. I call the lower level API here to reproduce the errors we're looking for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we will need to update these tests after we move worker_name_to_id to backend_config dict.

Distributed autograd was initialized after RPC and this would cause a
race in some scenarios where one node might have initialized distributed
autograd, calls backward() but other nodes have not initialized distributed
autograd yet.

Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly
has a sync between processes via the store.

Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Nov 8, 2019
Pull Request resolved: #29069

Distributed autograd was initialized after RPC and this would cause a
race in some scenarios where one node might have initialized distributed
autograd, calls backward() but other nodes have not initialized distributed
autograd yet.

Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly
has a sync between processes via the store.
ghstack-source-id: 93535922

Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 5e1983f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants