Skip to content

[RFC] RPC timeout #29402

@xush6528

Description

@xush6528

Issue

Now we have an arumgnet called rpc_timeout in the def init_model_parallel function.
The name feels like it's a client-perspective, round-trip RPC timeout.

While, looking at it's implementation, it's actually the server-side processing timeout. Client side takes no action at the moment.
That means, if the server-side failed to send the response, the user side future would wait forever.

API

I suggest that we provide an rpc_timeout arg on the 2 RPC entries,

  • rpc.rpc_async(func, args, kwargs, to, timeout: Optional[timedelta] = None) -> Any
  • rpc.rpc_sync(func, args, kwargs, to, Optional[timeout] = None) -> FutureMessage.

If timeout is not provided (passed as None), server side uses the global timeout set in def init_model_parallel.

If timeout is provided, it means a per-RPC timeout is specified.

  • For rpc.rpc_async, the future_message returned to the client should automatically cancle on timeout, and if users call future_message.wait(), it should raises a TimeoutError. For rpc.rpc_sync, it should block untill timeout and raise a TimeoutError.
  • The Message being passed from client to server should also contains this per-RPC timeout. On the server side, the timeout caontained in the message should be treated as per_rpc_server_proessing_timeout, thus overwriting the server-side global processing timeout.

Since the rpc_timeout passed to def init_model_parallel could be overwritten by per-RPC call, it should be renamed to global_rpc_server_processing_timeout.

Implied by the above, for supporting client-side timeout, the generic Future (#28923) needs to support timeout. A reference implementation is folly::Future::within(..).

Policy

For user RPCs, we always fill that in with the default rpc timeout.
For system RPCs, it'll default to 0 (which would be infinite) unless the system RPC sets it.

See #29018

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @xush6528

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions