-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[BE]: Update NCCL submodule to 2.22.3 #133593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_8-cuda12_4-test / test, linux-binary-manywheel / manywheel-py3_8-cuda12_4-split-test / test Details for Dev Infra teamRaised by workflow job |
nWEIdia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failure reminds me that such updates usually require S3 index updates (nvidia-nccl-cu11 and nvidia-nccl-cu12)
btw, Could this one wait for #132202 please?
We are targeting a Monday merge, praying for a zero conflict :)
18a29c0 to
bd355a2
Compare
bd355a2 to
9a4e880
Compare
| "nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | " | ||
| "nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | " | ||
| "nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | " | ||
| "nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here https://pypi.org/project/nvidia-nccl-cu11/ version: 2.21.5 is the latest one. I guess we only would need to update cu 12.1 and 12.4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eqy Do you know why there are no binaries released for 11.8 here? Is this intentional or a release bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If 2.22.3 is not supported for 11.8 and we need 2.22.3, we should stop supporting 11.8, but IMO this is a pretty big step to make... On the other hand, those constrains are only needed for poetry, aren't they? In that case, we might build/link with nccl statically for 11.8, but dynamically for 12.x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We kinda decided to stop supporting 11.8, so maybe we could update nccl? cc @kwen2501
malfet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not remove "nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | " constrains or just add statement, that we don't care about poetry users anymore...
That's not the issue, the issue is we have a divergence of supported NCCL versions. 11.8 only supports up to 2.21.5 while 12 supports newer versions. This will also be a problem for CUDNN especially as we update it to get needed CUDNN attention fixes. For NCCL it's a big problem since the CONDA 11.8 and the pypi builds willl diverge due to using different versions of submodule compiled code vs shipped binaries. |
|
Let's try to outline pros and cons of continuing to maintain 11.8 support in 2.5 (First PyTorch version to be released with 11.8 were PyTorch-2.0) |
|
Yeah, we could try to do some submodule hacking to change the tags in different dockers, but that seems like a pain. |
|
Let's deprecate CUDA 11.8 and update nccl at the same time, for 2.6 release probably. |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Update NCCL submodule to 2.22.3 . This is a pretty major release that has been out for a while and hasn't encountered any issues. It also fixes a lot bugs and enables some new NCCL features: https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-22-3.html . It's been out since June so it should be fairly stable.