🚀 The feature, motivation and pitch
I am trying to train a sparse-transformer using FSDP, where Feedforward Neural Network (FFN) in each layer is replaced with Mixture-of-Experts (MoE) implemented by fairscale.
As shown below, experts and non-experts have different data parallel group. If simply pass process_group [0, 1, 2, 3] to the FSDP constructor, it leads to abnormal backward behavior for the experts. Therefore, I would like to know if it's possible to set different process_group for the expert and non-expert.


Alternatives
No response
Additional context
No response
cc @zhaojuanmao @mrshenli @rohan-varma @awgu @fegin @penguinwu @kwen2501
🚀 The feature, motivation and pitch
I am trying to train a sparse-transformer using FSDP, where Feedforward Neural Network (FFN) in each layer is replaced with Mixture-of-Experts (MoE) implemented by fairscale.
As shown below, experts and non-experts have different data parallel group. If simply pass
process_group[0, 1, 2, 3] to the FSDP constructor, it leads to abnormal backward behavior for the experts. Therefore, I would like to know if it's possible to set differentprocess_groupfor the expert and non-expert.Alternatives
No response
Additional context
No response
cc @zhaojuanmao @mrshenli @rohan-varma @awgu @fegin @penguinwu @kwen2501