Skip to content

DistAutogradContext should be cleaned up on other nodes. #25525

@pritamdamania87

Description

@pritamdamania87

High level design: #23110

Currently, we clean up the distributed autograd context only on the local node once we're done. We need a framework call RPCs on other nodes and clean up the context on other nodes. This needs to be fault tolerant in the sense that we should retry the clean up operation in case of any failures and have some way of dealing with cleanup operations that failed multiple times.

cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528

Metadata

Metadata

Labels

high prioritymodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions