Custom Process Group for Each Module in FSDP

### 🚀 The feature, motivation and pitch

I am trying to train a sparse-transformer using FSDP, where Feedforward Neural Network (FFN) in each layer is replaced with Mixture-of-Experts (MoE) implemented by fairscale. 

As shown below, experts and non-experts have different data parallel group. If simply pass `process_group` [0, 1, 2, 3] to the FSDP constructor, it leads to abnormal backward behavior for the experts. Therefore, I would like to know if it's possible to set different `process_group` for the expert and non-expert.

 
![image](https://github.com/pytorch/pytorch/assets/17002231/22cadd33-b125-40df-9f9a-6f462ce63de1)
![image](https://github.com/pytorch/pytorch/assets/17002231/4b9e0a10-9bcc-4c55-8775-c9bd5aff8246)


### Alternatives

_No response_

### Additional context

_No response_

cc @zhaojuanmao @mrshenli @rohan-varma @awgu @fegin @penguinwu @kwen2501

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom Process Group for Each Module in FSDP #114361

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Custom Process Group for Each Module in FSDP #114361

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions