-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix distributed autograd initialization. #29069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix distributed autograd initialization. #29069
Conversation
Distributed autograd was initialized after RPC and this would cause a race in some scenarios where one node might have initialized distributed autograd, calls backward() but other nodes have not initialized distributed autograd yet. Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly has a sync between processes via the store. Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/) [ghstack-poisoned]
Distributed autograd was initialized after RPC and this would cause a race in some scenarios where one node might have initialized distributed autograd, calls backward() but other nodes have not initialized distributed autograd yet. Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly has a sync between processes via the store. Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/) ghstack-source-id: 93117986 Pull Request resolved: #29069
pietern
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
torch/distributed/rpc/__init__.py
Outdated
| store, _, _ = next(rendezvous_iterator) | ||
|
|
||
| # Initialize autograd before RPC since _init_rpc guarantees all | ||
| # processes sync via the store. If we initialize autograd after rpc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RPC and rpc
pietern
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is failing.
The RPC initialization tests may crash because you now end up initializing distributed autograd twice.
|
If #29157 is merged before this one, could you enable those two tests? |
|
it will close #28932 as well |
|
Shall we get this landed soon? |
|
This will also close #29410 |
Distributed autograd was initialized after RPC and this would cause a race in some scenarios where one node might have initialized distributed autograd, calls backward() but other nodes have not initialized distributed autograd yet. Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly has a sync between processes via the store. Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/) [ghstack-poisoned]
Pull Request resolved: #29069 Distributed autograd was initialized after RPC and this would cause a race in some scenarios where one node might have initialized distributed autograd, calls backward() but other nodes have not initialized distributed autograd yet. Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly has a sync between processes via the store. ghstack-source-id: 93505965 Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/)
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. Just had a question regarding init_rpc.
| with self.assertRaisesRegex(RuntimeError, "is not unique"): | ||
| rpc.init_model_parallel( | ||
| self_name="duplicate_name", | ||
| store, _, _ = next(torch.distributed.rendezvous( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why calling the lower level APIs instead of init_model_parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current changes init_model_parallel throws a KeyError because the worker_name_to_id[self_name] lookup fails for invalid strings. I call the lower level API here to reproduce the errors we're looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, we will need to update these tests after we move worker_name_to_id to backend_config dict.
Distributed autograd was initialized after RPC and this would cause a race in some scenarios where one node might have initialized distributed autograd, calls backward() but other nodes have not initialized distributed autograd yet. Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly has a sync between processes via the store. Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/) [ghstack-poisoned]
Pull Request resolved: #29069 Distributed autograd was initialized after RPC and this would cause a race in some scenarios where one node might have initialized distributed autograd, calls backward() but other nodes have not initialized distributed autograd yet. Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly has a sync between processes via the store. ghstack-source-id: 93535922 Differential Revision: [D18280875](https://our.internmc.facebook.com/intern/diff/D18280875/)
|
This pull request has been merged in 5e1983f. |
Stack from ghstack:
Distributed autograd was initialized after RPC and this would cause a
race in some scenarios where one node might have initialized distributed
autograd, calls backward() but other nodes have not initialized distributed
autograd yet.
Moving this before
_init_rpcfixes the problem since_init_rpcimplicitlyhas a sync between processes via the store.
Differential Revision: D18280875