-
Notifications
You must be signed in to change notification settings - Fork 27.4k
Expand Pytorch C10D backend to dynamic load third party communication library #27955
Description
Motivation
Expand Pytroch C10D backend to allow dynamic loading non-built-in communication libraries, as a preparation step to integrate Intel CCL (aka MLSL) to Pytorch as another c10d backend for supporting BFloat16 and future HW.
Pitch
Enrich Pytorch for better scaling efficiency on multi-node training
Additional Context
Expand Pytorch c10d built-in communication module mechanism to support dynamic loading 3rd communication python modules. The change is very small and made to c10d Python query mechanism. User needs specify a backend name and pass it to init_process_group() as a parameter in the python code, which calls c10d Python query mechanism. The c10d query mechanism is expanded to imported third party library according to the passed backend name. The third party library implements the process_group interface.
Intel CCL is added as third plug-in through Pytorch C++ extension. CCL threads can be pinned to specific cores through environment variables. It supports bfloat16 all reduce (bfloat16 gradient reduce to fp32) in the roadmap.
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528