-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Closed
Labels
high prioritymodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizerRelated to RPC, distributed autograd, RRef, and distributed optimizertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
High level design: #23110
Currently, we clean up the distributed autograd context only on the local node once we're done. We need a framework call RPCs on other nodes and clean up the context on other nodes. This needs to be fault tolerant in the sense that we should retry the clean up operation in case of any failures and have some way of dealing with cleanup operations that failed multiple times.
cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528
Metadata
Metadata
Assignees
Labels
high prioritymodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizerRelated to RPC, distributed autograd, RRef, and distributed optimizertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module