DistAutogradContext should be cleaned up on other nodes.

High level design: https://github.com/pytorch/pytorch/issues/23110

Currently, we clean up the distributed autograd context only on the local node once we're done. We need a framework call RPCs on other nodes and clean up the context on other nodes. This needs to be fault tolerant in the sense that we should retry the clean up operation in case of any failures and have some way of dealing with cleanup operations that failed multiple times.

cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DistAutogradContext should be cleaned up on other nodes. #25525

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DistAutogradContext should be cleaned up on other nodes. #25525

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions