RRef leak for other RPC Agents

For `ProcessGroupAgent`, there is a message counter matching to ensure all in-flight messages are finished before shutting down Agent. For other `RpcAgent`, there is no similar thing. So we want to eliminate `RpcAgent::join()`, by adding a common utility to do RPC-based barrier + shutdown in https://github.com/pytorch/pytorch/pull/30710.

Even with https://github.com/pytorch/pytorch/pull/30710, trying remove `agent.join()` uncovers that RRefs could be leaking, because the user workers might not have sent out the delete fork messages before they shut down RPC locally.

Now, sending out RRef fork delete messages solely depends on Python GC timing. We need to add an API in the RRefContext, to proactively send out delete fork message before shutting down the RPC layer on local. Only with this feature, there is no sporadic RRef leak for other RPC agents.

Now other RPC agents hide this RRef leak by sleeping for 2 seconds on shutting down, so that Python GC can trigger the messages.

It was also surfaced in https://github.com/pytorch/pytorch/issues/30022. When closing this issue, make sure the case mentioned in that issue is also covered.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @xush6528

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RRef leak for other RPC Agents #31325

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RRef leak for other RPC Agents #31325

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions