-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🐛 Describe the bug
Currently we do both we have submodule:
https://github.com/pytorch/pytorch/tree/main/third_party/nccl
And we use pypi nccl binaries:
https://github.com/pytorch/pytorch/blob/main/.github/scripts/generate_binary_build_matrix.py#L62
And we have a code to check if submodule version is consistent with pypi version, here:
https://github.com/pytorch/pytorch/blob/main/.github/scripts/generate_binary_build_matrix.py#L434
We also build latest nccl from source here:
https://github.com/pytorch/pytorch/blob/main/.ci/docker/common/install_cuda.sh#L74
This prevents us to have different nccl binaries for different CUDA builds. For instance latest nccl as of Jan 14 is 2.24.3 however we are still using 2.21.5 since its compatible with the CUDA 11.8.
We would prefer to keep nccl 2.21.5 for CUDA 11.8 builds but for CUDA 12.4 and 12.6 move to a newer nccl version
Hence a question what nccl submodule is used for and can we remove it and relay only on binaries ?
cc @malfet @seemethere @ptrblck @msaroufim @eqy @albanD @kwen2501
Versions
2.7