python udf over rpc #23569

zhaojuanmao · 2019-07-30T21:01:20Z

Summary:
This diff is to support python user defined function over rpc for #23110, work flow is like this:

pickle python udf
pass pickle to C++
C++ pass over rpc from client to server
server call runPythonUDF() python function to unpickle and run python udf and pickle the udf result using python embedder
pass back serialized result from server to client
client call loadPythonUDFResult() python function to unpickle result
return it to python

right now, put rpc_sync_builtin() and rpc_async_builtin() as temporary interfaces for builtin operator remote calls, they accept qualified name string, this interface can execute builtin operators in C++ land.

rpc_sync() and rpc_async() accept python callables only right now, it could be user define python functions or builtin operator python functions, the python functions will be executed in python land.

once we can resolve builtin operator python callables to qualified name string, we can merge rpc_sync_builtin() into rpc_sync() then

Differential Revision: D16390764

xush6528 · 2019-08-05T05:32:37Z

torch/distributed/rpc.py

we'd better assert here.

assert _agent is not None, "init_rpc(..) has not been called to setup rpc_agent yet."

Otherwise, the error message could look like

Traceback (most recent call last): File "/usr/local/fbcode/platform007/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/local/fbcode/platform007/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/data/users/shihaoxu/fbsource/fbcode/buck-out/dev/gen/caffe2/torch/fb/modelparallel/prototype/pytorch/tests/test_rpc#binary,link -tree/caffe2/torch/fb/modelparallel/prototype/pytorch/rpc.py", line 1057, in _peer_ping _low_level_rpc_no_result(_rpc_ping, inputs=[], worker_id=i) File "/data/users/shihaoxu/fbsource/fbcode/buck-out/dev/gen/caffe2/torch/fb/modelparallel/prototype/pytorch/tests/test_rpc#binary,link -tree/caffe2/torch/fb/modelparallel/prototype/pytorch/rpc.py", line 484, in _low_level_rpc_no_result _call_rpc(RPCRequest(func, inputs, kwargs), worker_id) File "/data/users/shihaoxu/fbsource/fbcode/buck-out/dev/gen/caffe2/torch/fb/modelparallel/prototype/pytorch/tests/test_rpc#binary,link -tree/caffe2/torch/fb/modelparallel/prototype/pytorch/rpc.py", line 472, in _call_rpc dst_name=WorkerCtx.instance().name_for_id(worker_id), File "/data/users/shihaoxu/fbsource/fbcode/buck-out/dev/gen/caffe2/torch/fb/modelparallel/prototype/pytorch/tests/test_rpc#binary,link -tree/caffe2/torch/fb/modelparallel/prototype/pytorch/comm.py", line 131, in send_callable dist.rpc_async(dst_name, _run_request, obj) File "/data/users/shihaoxu/fbsource/fbcode/buck-out/dev/gen/caffe2/torch/fb/modelparallel/prototype/pytorch/tests/test_rpc#binary,link -tree/torch/distributed/rpc.py", line 95, in rpc_async return invoke_rpc_python_udf(_agent, to, serialize(PythonUDF(func, args, kwargs))) TypeError: invoke_rpc_python_udf(): incompatible function arguments. The following argument types are supported: 1. (arg0: torch.distributed.RpcAgent, arg1: str, arg2: str) -> torch.distributed.FutureMessage Invoked with: None, 'w:0', b'\x80\x03ctorch.distributed.internal_rpc_utils\nPythonUDF\nq\x00ccaffe2.torch.fb.modelparallel.prototype.pyt orch.comm\n_run_request\nq\x01ccaffe2.torch.fb.modelparallel.prototype.pytorch.rpc\nRPCRequest\nq\x02)\x81q\x03}q\x04(X\x04\x00\x00\x00f uncq\x05ccaffe2.torch.fb.modelparallel.prototype.pytorch.rpc\n_rpc_ping\nq\x06X\x06\x00\x00\x00callerq\x07X\x03\x00\x00\x00w:1q\x08X\x06 \x00\x00\x00outputq\tNX\x06\x00\x00\x00inputsq\n]q\x0bX\x06\x00\x00\x00kwargsq\x0cNX\x0c\x00\x00\x00client_scopeq\rccaffe2.torch.fb.mode lparallel.prototype.pytorch.rpc\nRRefId\nq\x0eK\x01K\x00\x86q\x0f\x81q\x10ub}q\x11\x87q\x12\x81q\x13.' Traceback (most recent call last):

in which case _agent is None? maybe we should fix that first

xush6528 · 2019-08-05T08:46:02Z

torch/csrc/distributed/rpc/Message.cpp

This has been renamed.

MessageType::PYTHON_UDF_OP -> MessageType::PYTHON_UDF_OP

xush6528 · 2019-08-05T08:46:22Z

torch/csrc/distributed/rpc/Message.cpp

MessageType:: PYTHON_UDF_RET -> MessageType:: PYTHON_RET

xush6528 · 2019-08-05T09:56:47Z

torch/csrc/distributed/rpc/PythonRpcHandler.cpp

I think we need to discuss more about this line.

It looks to me that our user starts with launching a Python interpreter (a C program) to interpret a piece of Python code. This Python interpreter, loads a pybind11 module, torch.distributed, that bundles several C++ classes.
We use this special binding module, backed by C++ classes, to instantiate a RpcAgent C++ instance, and assigned it to a variable torch.distributed.rpc._agent.
While, in the constructor of RpcAgent, PythonRpcHandler::init() is called and creates a second Python interpreter ?? (If I understand the doc correctly)
This second Python interpreter imports the module, torch.distributed.internal_rpc_utils, and use the utilities in it.
Although it can successfully run the function sent by the client and return correct results back to the client.
It's impossible for the client to change the global state in the server, which is hold by the first Python interpreter.

There is a requirement in our RRef prototype implementation that we need to allow an RPC client (an RRef user), to dictate the RPC server (RRef owner) to change it's global state.
The state is stored in a module's global variable, which lives in the original Python Interpreter.

For example,

A client could ask the server to fetch an RRef instance that matches the ref_id in the message, from the server's global RRef registry.

A client could send termination signal message to a server, the server, on receiving the signal, should set it's status to termination_ongoing=True, so that it will delete the remote references it holds and notifies the owners about reducing ref counts before it goes away.

I read this doc, https://docs.python.org/3/extending/index.html#extending-index

It looks to me there are 2 ways for C/C++ to interact with Python.

Write modules in C or C++ to extend the Python interpreter with new modules.

Sometimes, rather than creating an extension that runs inside the Python interpreter as the main application, it is desirable to instead embed the CPython runtime inside a larger application.

Looks to me Py_Initialize() belongs to the interface for the seconds case.

I thought we only call Py_Initialize() once to import interpreter, as init() only is called once...? although we can enforce once as well, but I do not understand why it is called twice?

also could you please share what kind of error you encountered for the integration? we can sync up offline

@zhaojuanmao
NVM. There is a call to create a new Interpreter.
https://docs.python.org/3/c-api/init.html#c.Py_NewInterpreter

So I think Py_Initialize ensures singleton and initializes only once.

xush6528 · 2019-08-06T03:14:32Z

torch/csrc/distributed/rpc/PythonRpcHandler.cpp

This "kill"s Python interpreter and fails unit tests.

https://docs.python.org/3/c-api/init.html#c.Py_FinalizeEx

I added a flag to ensure we will call Py_Finalize() only if Py_initialize() is called in embedder

xush6528 · 2019-08-06T03:14:52Z

torch/csrc/distributed/rpc/ProcessGroupAgent.cpp

Can we move this into the destructor of ProcessGroupAgent.

More generally, the lifetime of PythonRpcHandler's "static" variables actually seems directly tied to ProcessGroupAgent lifetime. If this is true, can we just make PythonRpcHandler a member of ProcessGroupAgent?

that was initial thought, but turn out we need to call interface of PythonRpcHandler when we get result from future. At that point, future does not have processGroupAgent.

xush6528 · 2019-08-06T03:18:05Z

torch/csrc/distributed/rpc/PythonRpcHandler.cpp

We could remove this line, Py_Initialize();.

As documentation says,

This initializes the table of loaded modules (sys.modules), and creates the fundamental modules builtins, main and sys. It also initializes the module search path (sys.path). It does not set sys.argv;

Since the code has run to here, the interpreter system-level modules must have already been initialized, so no need to call initialize again.

we still need this if the application is C++ application and no python environment. added to check Py_IsInitialized() or not first, to ensure Py_Initialize() is called only once

xush6528 · 2019-08-06T05:33:43Z

test/test_rpc.py

This could be flaky, in worst case, a worker could have sent out it's request, but hasn't received a request yet. In that case, the global var it owns is not changed.

This is the code to simulate the worst case.

dstRank = n % self.world_size import time; time.sleep(self.rank) ret = dist.rpc_sync('worker%d' % dstRank, modify_global_var, args=(True,)) self.assertEqual(global_var, True)

Fix:

dstRank = n % self.world_size ret = dist.rpc_sync('worker%d' % dstRank, modify_global_var, args=(True,)) dist.barrier() # After this barrier, we know that rank[i] has processed the request to modify the global var from rank[i - 1]. self.assertEqual(global_var, True)

good catch!

Thinking more about this. It seems neither dist.barrer() nor the existing sync_rpc can guarantee a fix on this problem. Because they only enforces all sends are done, but there is no guarantee that all received messages are processed.

This can be fixed when we switch to ThreadPool for recv and let sync_rpc block until both send and recv are done. But the question is what should we wait for if received messages trigger subsequent sends (and those sends can trigger additional sends).

I am thinking about the just doing following:

block send queue, and wait till all sends are done.

block recv queue, and wait till all recvs are done.

This means that we at least wait until existing sends are done and they are processed on the callee side (sufficient for this test). And we are not going to wait for additional send/recv triggered by existing sends. Does this make sense? @xush6528 @zhaojuanmao

(this is not a change request for this PR, just discussion)

In terms of changes, if we all agree that dist.barrier does not guarantee to fix the problem, shall we disable the test for now?

sounds good, let me disable the test for now

ezyang · 2019-08-09T16:11:33Z

torch/csrc/distributed/rpc/PythonRpcHandler.cpp

Under what circumstances will we be sending Python UDFs to remote processes that aren't already running Python? Does it simplify matters if you can assume that Python workers always start up from Python?

ezyang

Beyond some of the smaller requests, at a larger level I'd like some clarity on the lifetime of PythonRpcHandler object, and the relationship between C++ only execution versus C++ and Python execution.

Sorry about the late review; release was this week and I kept putting this PR review off XD

zhaojuanmao · 2019-08-09T17:47:07Z

Thanks for your comments @ezyang!

for PythonRpcHandler, we do not want it to have same lifetime as processGroupAgent, because some interface like "loadPythonUDFResult()" need to be called without processGroupAgent. Ideally all interfaces except init() of PythonRpcHandler should be able to be called everywhere. I think it makes sense to make it as singleton class or just make it as namespace + global functions/variables. How do you think?
"the relationship between C++ only execution versus C++ and Python execution." -- would you please clarify it more? I do not quite understand it :(.

built in op can be executed remotely in C++ purely, because JIT support so.
for pure python function, we have to pickle it and unpickle it on python land, and execute it on python land as well. that is why C++ and python execution are mixed

zhaojuanmao · 2019-08-09T22:57:35Z

Thanks everyone for giving such great feedback!

I resolved most of comments including switch c python API to pybind11 API, and others.

Would like to address the comment about making binary format consistent btw python UDF and builtin operator in follow up PR.

mrshenli

Thanks for adding Python UDF support for RPC!

Approving, but please consider performing the following two actions before merging:

Check with @gqchen to see if there is any additional blocking change requests.
Make sure this works under OSX. I am not sure if CI tests for MaxOS here are sufficient. #23228 run into this problem previously.

mrshenli · 2019-08-11T21:42:45Z

torch/csrc/distributed/rpc/ProcessGroupAgent.cpp

(This does not need to be fixed in this PR)

IIUC, this is trying to address @ezyang's comments regarding PythonRpcHandler's lifetime. It seems to me this needs to be wrapped by some macro, because we might want to run pure C++ RPC in the future, where this should not depend on Python.

mrshenli · 2019-08-11T21:56:21Z

torch/CMakeLists.txt

(Does not need to be addressed in this PR)

When coming up with file names, I used camel for class files and snake for utility files, and then order all files here based on ascii order. This strategy was learnt as mix from cuda and c10d folders. @pietern @ezyang Should we all stay with snake naming for files? Do we have a convention for this?

(pointed out by @satgera )This probably is my fault in the first place, and I will submit a fix for it in a separate PR.

I'd aim for self-consistency, and the prevailing convention in torch/csrc is snake case.

pritamdamania87 · 2019-08-13T17:09:25Z

Yep, that's not a problem specific for this PR. And I think can be fixed in follow PRs. It's a good point to add tests for kwargs, and I think this can also be done in followup PRs?

Can we create github issues for things to address in followup PRs? This way we wouldn't lose track of these things.

mrshenli · 2019-08-13T17:37:56Z

Can we create github issues for things to address in followup PRs? This way we wouldn't lose track of these things.

Yep, see #24252 #24249 and #24247

zhaojuanmao · 2019-08-15T01:10:53Z

the windows build failure is not related, all other checks are clean. macos build locally, import torch and they are fine

facebook-github-bot

@zhaojuanmao is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: This diff is to support python user defined function over rpc for pytorch#23110, work flow is like this: 1. pickle python udf 2. pass pickle to C++ 3. C++ pass over rpc from client to server 4. server call runPythonUDF() python function to unpickle and run python udf and pickle the udf result using python embedder 6. pass back serialized result from server to client 7. client call loadPythonUDFResult() python function to unpickle result 7. return it to python right now, put rpc_sync_builtin() and rpc_async_builtin() as temporary interfaces for builtin operator remote calls, they accept qualified name string, this interface can execute builtin operators in C++ land. rpc_sync() and rpc_async() accept python callables only right now, it could be user define python functions or builtin operator python functions, the python functions will be executed in python land. once we can resolve builtin operator python callables to qualified name string, we can merge rpc_sync_builtin() into rpc_sync() then Pull Request resolved: pytorch#23569 Test Plan: unit tests Differential Revision: D16390764 fbshipit-source-id: 9373024280d08bc953391464f221e8ab65e9ba10

zhaojuanmao requested review from apaszke, mrshenli and pietern as code owners July 30, 2019 21:01

pytorchbot added caffe2 module: build Build system issues oncall: distributed Add this issue/PR to distributed oncall triage queue module: internals Related to internal abstractions in c10 and ATen module: tests Issues related to tests (not the torch.testing module) labels Jul 30, 2019

zhaojuanmao force-pushed the export-D16390764 branch from e27639c to fc8d69a Compare July 30, 2019 21:18

zhaojuanmao force-pushed the export-D16390764 branch from fc8d69a to d5d498f Compare August 2, 2019 19:49

xush6528 reviewed Aug 5, 2019

View reviewed changes

zhaojuanmao requested review from aazzolini and pritamdamania87 August 5, 2019 18:05

zhaojuanmao force-pushed the export-D16390764 branch from d5d498f to b45d978 Compare August 5, 2019 22:44

zhaojuanmao force-pushed the export-D16390764 branch from b45d978 to 3f77936 Compare August 5, 2019 22:50

zhaojuanmao force-pushed the export-D16390764 branch from 3f77936 to f7bd37f Compare August 6, 2019 00:27

xush6528 reviewed Aug 6, 2019

View reviewed changes

zhaojuanmao force-pushed the export-D16390764 branch from f7bd37f to c85b516 Compare August 6, 2019 04:35

xush6528 reviewed Aug 6, 2019

View reviewed changes

zhaojuanmao requested review from ezyang and zdevito August 6, 2019 16:24

zhaojuanmao force-pushed the export-D16390764 branch from c85b516 to 933712d Compare August 6, 2019 16:51

ezyang reviewed Aug 9, 2019

View reviewed changes

ezyang requested changes Aug 9, 2019

View reviewed changes

zhaojuanmao force-pushed the export-D16390764 branch from 80a8f0b to 6d05028 Compare August 9, 2019 22:45

gqchen self-requested a review August 9, 2019 22:48

zhaojuanmao requested a review from ezyang August 9, 2019 22:54

mrshenli approved these changes Aug 11, 2019

View reviewed changes

mrshenli mentioned this pull request Aug 12, 2019

[WIP] Assign each RpcAgent a unique ID, and use ID to send RPC messages. #24189

Closed

This was referenced Aug 13, 2019

[RPC] Split RPC API for sync and async to have explicit return types #24247

Closed

[RPC] Add tests for using kwargs in Python UDF #24249

Closed

[RPC] Python UDF and Builtin Ops should serialize to the same message layout #24252

Closed

This was referenced Aug 13, 2019

Use c10::ThreadPool to send messages #23968

Closed

Support Callbacks on Asynchronous RPC #24118

Closed

zhaojuanmao force-pushed the export-D16390764 branch from 6d05028 to c347426 Compare August 14, 2019 18:06

zhaojuanmao force-pushed the export-D16390764 branch from c347426 to d9832df Compare August 14, 2019 18:43

facebook-github-bot reviewed Aug 15, 2019

View reviewed changes

zhaojuanmao force-pushed the export-D16390764 branch from d9832df to 6f02e54 Compare August 15, 2019 01:18

zhaojuanmao force-pushed the export-D16390764 branch from 6f02e54 to 694a4de Compare August 15, 2019 04:24

facebook-github-bot closed this in ab39a55 Aug 15, 2019

zhaojuanmao deleted the export-D16390764 branch August 28, 2019 07:55

python udf over rpc #23569

python udf over rpc #23569

Uh oh!

Conversation

zhaojuanmao commented Jul 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xush6528 Aug 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xush6528 Aug 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xush6528 Aug 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xush6528 Aug 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao commented Aug 9, 2019

Uh oh!

zhaojuanmao commented Aug 9, 2019

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 commented Aug 13, 2019

Uh oh!

mrshenli commented Aug 13, 2019

Uh oh!

zhaojuanmao commented Aug 15, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

xush6528 Aug 6, 2019 •

edited

Loading

xush6528 Aug 6, 2019 •

edited

Loading

xush6528 Aug 6, 2019 •

edited

Loading

xush6528 Aug 6, 2019 •

edited

Loading