-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Move RPC API to torch.distributed.rpc #27290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
test/dist_utils.py
Outdated
| backend=BACKEND, | ||
| self_rank=self.rank, | ||
| init_method=RPC_INIT_URL) | ||
| rpc.init_model_parallel(self_name='worker%d' % self.rank, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's odd to have init_model_parallel under "rpc".
it should be like
import torch.distributed.model_parallel as model_parallel
model_parallel.init_model_parallel(...)
model_parallel.rpc_sync(...)
@mrshenli What's your opion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's odd to have init_model_parallel under "rpc".
I agree. Probably we don't need to use the phrase "model parallel" here. @aazzolini has a point that "model parallel" does not fully cover the scope of this feature. For example, torch.distributed.rpc, can support data parallel and parameter server as well.
This init method here initializes contexts and setup communication channels if necessary. After this point, the application could use it for distributed training, and might use it for other purposes as well. So, maybe, the name does not have to imply what the user would like to do next.
Shall we call it torch.distributed.rpc.init_worker instead? The sync API can also be named as torch.distributed.rpc.sync_worker, but let's add sync_worker in a followup PR after we address #27096.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we call it torch.distributed.rpc.init_worker instead? The sync API can also be named as torch.distributed.rpc.sync_worker, but let's add sync_worker in a followup PR after we address #27096.
I'm assuming the proposal here is that we will still have init_model_parallel in a different package that would call rpc.init_worker underneath?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming the proposal here is that we will still have init_model_parallel in a different package that would call rpc.init_worker underneath?
No, I was actually proposing to replace init_model_parallel with init_worker. Say we provide an init_model_parallel training, and an application would like to use rpc and distributed autograd to do data parallel training, it would look weird that the application has to call init_model_parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say we provide an init_model_parallel training, and an application would like to use rpc and distributed autograd to do data parallel training, it would look weird that the application has to call init_model_parallel.
If you are using distributed autograd, doesn't that imply model parallel? Distributed autograd comes into play only if a model is split across multiple machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "model parallel" should even be a word anywhere in pytorch.distributed . Model parallelism is a charged term that we don't really need in order to convey what the rpc package does. RRef is a natural consequence of being able to run remote "methods" (as opposed to functions), and distributed autograd is a natural consequence of being able to pass tensors with require_grad=True over RPC. None of this needs to imply that a model even exists, here.
Model parallel only makes sense if you have a model, so it's more tied to the concept of nn.Module etc.
So I really think that init_model_parallel doesn't need to exist as a function anywheree. instead, call it rpc.init_worker() or even rpc.init().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(As a side note, at some point soon it would be better if we closed on a final API since these changes will actually break some of the stuff we have built on top of it already).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- init_model_parallel is too broad here, since we're only initializing the RPC.
This this PR only moves the stuff around, let's stack another PR on top that renames (and adds a few lines for backward compat for everything that's in flight today).
rohan-varma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Summary: Pull Request resolved: pytorch#27290 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D17808212 Pulled By: pietern fbshipit-source-id: c79907940fe4888b2ceaaa1cda0078e39c89b454
Stack from ghstack:
Differential Revision: D17808212