-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
For ProcessGroupAgent, there is a message counter matching to ensure all in-flight messages are finished before shutting down Agent. For other RpcAgent, there is no similar thing. So we want to eliminate RpcAgent::join(), by adding a common utility to do RPC-based barrier + shutdown in #30710.
Even with #30710, trying remove agent.join() uncovers that RRefs could be leaking, because the user workers might not have sent out the delete fork messages before they shut down RPC locally.
Now, sending out RRef fork delete messages solely depends on Python GC timing. We need to add an API in the RRefContext, to proactively send out delete fork message before shutting down the RPC layer on local. Only with this feature, there is no sporadic RRef leak for other RPC agents.
Now other RPC agents hide this RRef leak by sleeping for 2 seconds on shutting down, so that Python GC can trigger the messages.
It was also surfaced in #30022. When closing this issue, make sure the case mentioned in that issue is also covered.
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @xush6528