Skip to content

Conversation

@pietern
Copy link
Contributor

@pietern pietern commented Oct 3, 2019

Stack from ghstack:

Differential Revision: D17808212

pietern added a commit that referenced this pull request Oct 3, 2019
ghstack-source-id: aacc1b5
Pull Request resolved: #27290
@pritamdamania87 pritamdamania87 self-requested a review October 3, 2019 18:43
backend=BACKEND,
self_rank=self.rank,
init_method=RPC_INIT_URL)
rpc.init_model_parallel(self_name='worker%d' % self.rank,
Copy link
Contributor

@xush6528 xush6528 Oct 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd to have init_model_parallel under "rpc".

it should be like

import torch.distributed.model_parallel as model_parallel

model_parallel.init_model_parallel(...)
model_parallel.rpc_sync(...)

@mrshenli What's your opion?

Copy link
Contributor

@mrshenli mrshenli Oct 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd to have init_model_parallel under "rpc".

I agree. Probably we don't need to use the phrase "model parallel" here. @aazzolini has a point that "model parallel" does not fully cover the scope of this feature. For example, torch.distributed.rpc, can support data parallel and parameter server as well.

This init method here initializes contexts and setup communication channels if necessary. After this point, the application could use it for distributed training, and might use it for other purposes as well. So, maybe, the name does not have to imply what the user would like to do next.

Shall we call it torch.distributed.rpc.init_worker instead? The sync API can also be named as torch.distributed.rpc.sync_worker, but let's add sync_worker in a followup PR after we address #27096.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we call it torch.distributed.rpc.init_worker instead? The sync API can also be named as torch.distributed.rpc.sync_worker, but let's add sync_worker in a followup PR after we address #27096.

I'm assuming the proposal here is that we will still have init_model_parallel in a different package that would call rpc.init_worker underneath?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the proposal here is that we will still have init_model_parallel in a different package that would call rpc.init_worker underneath?

No, I was actually proposing to replace init_model_parallel with init_worker. Say we provide an init_model_parallel training, and an application would like to use rpc and distributed autograd to do data parallel training, it would look weird that the application has to call init_model_parallel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say we provide an init_model_parallel training, and an application would like to use rpc and distributed autograd to do data parallel training, it would look weird that the application has to call init_model_parallel.

If you are using distributed autograd, doesn't that imply model parallel? Distributed autograd comes into play only if a model is split across multiple machines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "model parallel" should even be a word anywhere in pytorch.distributed . Model parallelism is a charged term that we don't really need in order to convey what the rpc package does. RRef is a natural consequence of being able to run remote "methods" (as opposed to functions), and distributed autograd is a natural consequence of being able to pass tensors with require_grad=True over RPC. None of this needs to imply that a model even exists, here.

Model parallel only makes sense if you have a model, so it's more tied to the concept of nn.Module etc.

So I really think that init_model_parallel doesn't need to exist as a function anywheree. instead, call it rpc.init_worker() or even rpc.init().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(As a side note, at some point soon it would be better if we closed on a final API since these changes will actually break some of the stuff we have built on top of it already).

Copy link
Contributor Author

@pietern pietern Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- init_model_parallel is too broad here, since we're only initializing the RPC.

This this PR only moves the stuff around, let's stack another PR on top that renames (and adds a few lines for backward compat for everything that's in flight today).

@rohan-varma rohan-varma self-requested a review October 8, 2019 17:30
Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@facebook-github-bot
Copy link
Contributor

@pietern merged this pull request in b4ce922.

@facebook-github-bot facebook-github-bot deleted the gh/pietern/45/head branch October 28, 2019 22:18
thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
Summary: Pull Request resolved: pytorch#27290

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D17808212

Pulled By: pietern

fbshipit-source-id: c79907940fe4888b2ceaaa1cda0078e39c89b454
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants