Make DDP failure recoverable

@kuttas @pietern and I had a discussion on how to make DDP failure recoverable. The process involves the following steps:

```python
m = SomeModel()
dist.init_process_group()
ddp = DistributedDataParallel(m)
# got error
dist.destroy_process_group()
dist.init_process_group()
del ddp
ddp = DistributedDataParallel(m)
```

This does not work in today's DDP. Currently, to get better performance, DDP assigns the [original module](https://github.com/pytorch/pytorch/blob/fa4ca4e70e98fbe944d1435929ea4fd17c7bed7d/torch/nn/parallel/distributed.py#L316) to the first module replica instead of creating a new one. Then, it creates a new `Reducer` to [add post hooks](https://github.com/pytorch/pytorch/blob/fa4ca4e70e98fbe944d1435929ea4fd17c7bed7d/torch/csrc/distributed/c10d/reducer.cpp#L108-L112) to sync params. However, because every reconstructed DDP instance wraps the same original module, all their reducers will add hooks to the same set of variables. Hence, after 10 recoveries, each param (variable) in the original module will have 11 hooks introduced by 11 different reducers, where only the last one is still alive. We thought about several potential solutions:

#### Solution 1: Force Module Replication

Force DDP to create new module replicas instead of using the original module. In this way, those variables in the replicas will die together with the DDP instance. But it will make the DDP slower. Maybe make it an option?

#### Solution 2: Delete Hooks in Destructor

I feel the best way would be deleting those hooks from model variables when destructing a `Reducer`, but I didn't find a clean way to do that. The [`add_post_hook`](https://github.com/pytorch/pytorch/blob/fa4ca4e70e98fbe944d1435929ea4fd17c7bed7d/torch/csrc/autograd/function.h#L250) function takes unique parameters, and we can get those hooks through [`post_hooks`](https://github.com/pytorch/pytorch/blob/fa4ca4e70e98fbe944d1435929ea4fd17c7bed7d/torch/csrc/autograd/function.h#L254). Directly looping through the the hooks vector and find the target to delete seems to be too hackish.

#### Solution 3: Create New Variables (?)

Not sure if this can work. Instead of creating replica (as in Solution 1), let DDP create a new variable for every parameter in the original module. All DDP forward and backward pass will use those new variables. I think this won't work if the application only wraps part of the model using DDP, because there will be two disjoint autograd graphs (?)


@soumith @gchanan @ezyang  thoughts?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make DDP failure recoverable #21344

Solution 1: Force Module Replication

Solution 2: Delete Hooks in Destructor

Solution 3: Create New Variables (?)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make DDP failure recoverable #21344

Description

Solution 1: Force Module Replication

Solution 2: Delete Hooks in Destructor

Solution 3: Create New Variables (?)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions