Skip to content

Custom Process Group for Each Module in FSDP #114361

@liuslnlp

Description

@liuslnlp

🚀 The feature, motivation and pitch

I am trying to train a sparse-transformer using FSDP, where Feedforward Neural Network (FFN) in each layer is replaced with Mixture-of-Experts (MoE) implemented by fairscale.

As shown below, experts and non-experts have different data parallel group. If simply pass process_group [0, 1, 2, 3] to the FSDP constructor, it leads to abnormal backward behavior for the experts. Therefore, I would like to know if it's possible to set different process_group for the expert and non-expert.

image
image

Alternatives

No response

Additional context

No response

cc @zhaojuanmao @mrshenli @rohan-varma @awgu @fegin @penguinwu @kwen2501

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: fsdptriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions