Skip to content

RRef leak for other RPC Agents #31325

@xush6528

Description

@xush6528

For ProcessGroupAgent, there is a message counter matching to ensure all in-flight messages are finished before shutting down Agent. For other RpcAgent, there is no similar thing. So we want to eliminate RpcAgent::join(), by adding a common utility to do RPC-based barrier + shutdown in #30710.

Even with #30710, trying remove agent.join() uncovers that RRefs could be leaking, because the user workers might not have sent out the delete fork messages before they shut down RPC locally.

Now, sending out RRef fork delete messages solely depends on Python GC timing. We need to add an API in the RRefContext, to proactively send out delete fork message before shutting down the RPC layer on local. Only with this feature, there is no sporadic RRef leak for other RPC agents.

Now other RPC agents hide this RRef leak by sleeping for 2 seconds on shutting down, so that Python GC can trigger the messages.

It was also surfaced in #30022. When closing this issue, make sure the case mentioned in that issue is also covered.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @xush6528

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions