Skip to content

Expand Pytorch C10D backend to dynamic load third party communication library #27955

@Jianhui-Li

Description

@Jianhui-Li

Motivation

Expand Pytroch C10D backend to allow dynamic loading non-built-in communication libraries, as a preparation step to integrate Intel CCL (aka MLSL) to Pytorch as another c10d backend for supporting BFloat16 and future HW.
 

Pitch

Enrich Pytorch for better scaling efficiency on multi-node training

Additional Context

Expand Pytorch c10d built-in communication module mechanism to support dynamic loading 3rd communication python modules. The change is very small and made to c10d Python query mechanism. User needs specify a backend name and pass it to init_process_group() as a parameter in the python code, which calls c10d Python query mechanism. The c10d query mechanism is expanded to imported third party library according to the passed backend name. The third party library implements the process_group interface.

Intel CCL is added as third plug-in through Pytorch C++ extension. CCL threads can be pinned to specific cores through environment variables. It supports bfloat16 all reduce (bfloat16 gradient reduce to fp32) in the roadmap.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureA request for a proper, new feature.oncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions