Move RPC API to torch.distributed.rpc #27290

pietern · 2019-10-03T14:34:37Z

Stack from ghstack:

Run clang-format for torch/distributed/rpc #27531 Run clang-format for torch/distributed/rpc
Rename PythonUDF{Call,Resp} #27530 Rename PythonUDF{Call,Resp}
Scope pybind11 functions to torch.distributed.{autograd,rpc} #27529 Scope pybind11 functions to torch.distributed.{autograd,rpc}
Move RPC API to torch.distributed.rpc #27290 Move RPC API to torch.distributed.rpc
Move internal functions to torch.distributed.rpc #27289 Move internal functions to torch.distributed.rpc
Move RPC backend registry to torch.distributed.rpc #27288 Move RPC backend registry to torch.distributed.rpc
Remove torch.distributed.rpc function #27287 Remove torch.distributed.rpc function
Rename variables and add comments #27286 Rename variables and add comments
Remove shebang from non-executable files in torch.distributed #27285 Remove shebang from non-executable files in torch.distributed
Fix pybind11 warnings in python_rpc_handler.cpp #27284 Fix pybind11 warnings in python_rpc_handler.cpp

Differential Revision: D17808212

ghstack-source-id: aacc1b5 Pull Request resolved: #27290

xush6528 · 2019-10-03T18:57:36Z

test/dist_utils.py

-                                 backend=BACKEND,
-                                 self_rank=self.rank,
-                                 init_method=RPC_INIT_URL)
+        rpc.init_model_parallel(self_name='worker%d' % self.rank,


It's odd to have init_model_parallel under "rpc".

it should be like

import torch.distributed.model_parallel as model_parallel model_parallel.init_model_parallel(...) model_parallel.rpc_sync(...)

@mrshenli What's your opion?

It's odd to have init_model_parallel under "rpc".

I agree. Probably we don't need to use the phrase "model parallel" here. @aazzolini has a point that "model parallel" does not fully cover the scope of this feature. For example, torch.distributed.rpc, can support data parallel and parameter server as well.

This init method here initializes contexts and setup communication channels if necessary. After this point, the application could use it for distributed training, and might use it for other purposes as well. So, maybe, the name does not have to imply what the user would like to do next.

Shall we call it torch.distributed.rpc.init_worker instead? The sync API can also be named as torch.distributed.rpc.sync_worker, but let's add sync_worker in a followup PR after we address #27096.

Shall we call it torch.distributed.rpc.init_worker instead? The sync API can also be named as torch.distributed.rpc.sync_worker, but let's add sync_worker in a followup PR after we address #27096.

I'm assuming the proposal here is that we will still have init_model_parallel in a different package that would call rpc.init_worker underneath?

I'm assuming the proposal here is that we will still have init_model_parallel in a different package that would call rpc.init_worker underneath?

No, I was actually proposing to replace init_model_parallel with init_worker. Say we provide an init_model_parallel training, and an application would like to use rpc and distributed autograd to do data parallel training, it would look weird that the application has to call init_model_parallel.

Say we provide an init_model_parallel training, and an application would like to use rpc and distributed autograd to do data parallel training, it would look weird that the application has to call init_model_parallel.

If you are using distributed autograd, doesn't that imply model parallel? Distributed autograd comes into play only if a model is split across multiple machines.

I don't think "model parallel" should even be a word anywhere in pytorch.distributed . Model parallelism is a charged term that we don't really need in order to convey what the rpc package does. RRef is a natural consequence of being able to run remote "methods" (as opposed to functions), and distributed autograd is a natural consequence of being able to pass tensors with require_grad=True over RPC. None of this needs to imply that a model even exists, here.

Model parallel only makes sense if you have a model, so it's more tied to the concept of nn.Module etc.

So I really think that init_model_parallel doesn't need to exist as a function anywheree. instead, call it rpc.init_worker() or even rpc.init().

(As a side note, at some point soon it would be better if we closed on a final API since these changes will actually break some of the stuff we have built on top of it already).

I agree -- init_model_parallel is too broad here, since we're only initializing the RPC.

This this PR only moves the stuff around, let's stack another PR on top that renames (and adds a few lines for backward compat for everything that's in flight today).

torch/distributed/rpc/api.py

rohan-varma

LGTM!

facebook-github-bot · 2019-10-08T20:34:41Z

@pietern merged this pull request in b4ce922.

Summary: Pull Request resolved: pytorch#27290 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D17808212 Pulled By: pietern fbshipit-source-id: c79907940fe4888b2ceaaa1cda0078e39c89b454

Move RPC API to torch.distributed.rpc

c57dcc6

pietern requested review from apaszke and mrshenli as code owners October 3, 2019 14:34

pytorchbot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 3, 2019

Update on "Move RPC API to torch.distributed.rpc"

0851c90

pietern added a commit that referenced this pull request Oct 3, 2019

Move RPC API to torch.distributed.rpc

fdcddca

ghstack-source-id: aacc1b5 Pull Request resolved: #27290

pietern mentioned this pull request Oct 3, 2019

Make what RPC module wants to expose clearer #27229

Closed

pritamdamania87 self-requested a review October 3, 2019 18:43

xush6528 reviewed Oct 3, 2019

View reviewed changes

rohan-varma reviewed Oct 4, 2019

View reviewed changes

torch/distributed/rpc/api.py Show resolved Hide resolved

Update on "Move RPC API to torch.distributed.rpc"

66ee675

This was referenced Oct 7, 2019

Scope pybind11 functions to torch.distributed.{autograd,rpc} #27483

Closed

Rename PythonUDF{Call,Resp} #27484

Closed

Run clang-format for torch/distributed/rpc #27485

Closed

Update on "Move RPC API to torch.distributed.rpc"

661e256

This was referenced Oct 8, 2019

Scope pybind11 functions to torch.distributed.{autograd,rpc} #27529

Closed

Rename PythonUDF{Call,Resp} #27530

Closed

Run clang-format for torch/distributed/rpc #27531

Closed

Update on "Move RPC API to torch.distributed.rpc"

79164cf

rohan-varma self-requested a review October 8, 2019 17:30

rohan-varma approved these changes Oct 8, 2019

View reviewed changes

mrshenli approved these changes Oct 8, 2019

View reviewed changes

facebook-github-bot closed this in b4ce922 Oct 8, 2019

facebook-github-bot added the merged label Oct 8, 2019

mrshenli mentioned this pull request Oct 10, 2019

🚀 Graceful RPCAgent termination in multi-driver scenario #27647

Open

facebook-github-bot deleted the gh/pietern/45/head branch October 28, 2019 22:18

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move RPC API to torch.distributed.rpc #27290

Move RPC API to torch.distributed.rpc #27290

Uh oh!

pietern commented Oct 3, 2019 •

edited

Loading

Uh oh!

xush6528 Oct 3, 2019 •

edited

Loading

Uh oh!

mrshenli Oct 4, 2019 •

edited

Loading

Uh oh!

pritamdamania87 Oct 4, 2019

Uh oh!

mrshenli Oct 4, 2019

Uh oh!

pritamdamania87 Oct 5, 2019

Uh oh!

aazzolini Oct 5, 2019

Uh oh!

aazzolini Oct 5, 2019

Uh oh!

pietern Oct 8, 2019 •

edited

Loading

Uh oh!

Uh oh!

rohan-varma left a comment

Uh oh!

facebook-github-bot commented Oct 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Move RPC API to torch.distributed.rpc #27290

Move RPC API to torch.distributed.rpc #27290

Uh oh!

Conversation

pietern commented Oct 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xush6528 Oct 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli Oct 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 Oct 4, 2019

Choose a reason for hiding this comment

Uh oh!

mrshenli Oct 4, 2019

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 Oct 5, 2019

Choose a reason for hiding this comment

Uh oh!

aazzolini Oct 5, 2019

Choose a reason for hiding this comment

Uh oh!

aazzolini Oct 5, 2019

Choose a reason for hiding this comment

Uh oh!

pietern Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

pietern commented Oct 3, 2019 •

edited

Loading

xush6528 Oct 3, 2019 •

edited

Loading

mrshenli Oct 4, 2019 •

edited

Loading

pietern Oct 8, 2019 •

edited

Loading