-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
Issue
Now we have an arumgnet called rpc_timeout in the def init_model_parallel function.
The name feels like it's a client-perspective, round-trip RPC timeout.
While, looking at it's implementation, it's actually the server-side processing timeout. Client side takes no action at the moment.
That means, if the server-side failed to send the response, the user side future would wait forever.
API
I suggest that we provide an rpc_timeout arg on the 2 RPC entries,
rpc.rpc_async(func, args, kwargs, to, timeout: Optional[timedelta] = None) -> Anyrpc.rpc_sync(func, args, kwargs, to, Optional[timeout] = None) -> FutureMessage.
If timeout is not provided (passed as None), server side uses the global timeout set in def init_model_parallel.
If timeout is provided, it means a per-RPC timeout is specified.
- For
rpc.rpc_async, thefuture_messagereturned to the client should automatically cancle on timeout, and if users callfuture_message.wait(), it should raises aTimeoutError. Forrpc.rpc_sync, it should block untill timeout and raise aTimeoutError. - The
Messagebeing passed from client to server should also contains this per-RPC timeout. On the server side, the timeout caontained in the message should be treated asper_rpc_server_proessing_timeout, thus overwriting the server-side global processing timeout.
Since the rpc_timeout passed to def init_model_parallel could be overwritten by per-RPC call, it should be renamed to global_rpc_server_processing_timeout.
Implied by the above, for supporting client-side timeout, the generic Future (#28923) needs to support timeout. A reference implementation is folly::Future::within(..).
Policy
For user RPCs, we always fill that in with the default rpc timeout.
For system RPCs, it'll default to 0 (which would be infinite) unless the system RPC sets it.
See #29018
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @xush6528