Skip to content

Is it possible to remove NCCL submodule and use only nccl binaries from pypi instead ? #144768

@atalman

Description

@atalman

🐛 Describe the bug

Currently we do both we have submodule:
https://github.com/pytorch/pytorch/tree/main/third_party/nccl

And we use pypi nccl binaries:
https://github.com/pytorch/pytorch/blob/main/.github/scripts/generate_binary_build_matrix.py#L62

And we have a code to check if submodule version is consistent with pypi version, here:
https://github.com/pytorch/pytorch/blob/main/.github/scripts/generate_binary_build_matrix.py#L434

We also build latest nccl from source here:
https://github.com/pytorch/pytorch/blob/main/.ci/docker/common/install_cuda.sh#L74

This prevents us to have different nccl binaries for different CUDA builds. For instance latest nccl as of Jan 14 is 2.24.3 however we are still using 2.21.5 since its compatible with the CUDA 11.8.

We would prefer to keep nccl 2.21.5 for CUDA 11.8 builds but for CUDA 12.4 and 12.6 move to a newer nccl version

Hence a question what nccl submodule is used for and can we remove it and relay only on binaries ?

cc @malfet @seemethere @ptrblck @msaroufim @eqy @albanD @kwen2501

Versions

2.7

Metadata

Metadata

Assignees

Labels

module: buildBuild system issuesmodule: cudaRelated to torch.cuda, and CUDA support in generalmodule: ncclProblems related to nccl supporttriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions