Skip to content

Conversation

@Skylion007
Copy link
Collaborator

Update NCCL submodule to 2.22.3 . This is a pretty major release that has been out for a while and hasn't encountered any issues. It also fixes a lot bugs and enables some new NCCL features: https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-22-3.html . It's been out since June so it should be fairly stable.

@Skylion007 Skylion007 requested review from a team and jeffdaily as code owners August 15, 2024 17:51
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133593

Note: Links to docs will display an error until the docs builds have been completed.

❌ 25 New Failures, 11 Unrelated Failures

As of commit 9a4e880 with merge base 215b145 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ezyang
ezyang previously approved these changes Aug 16, 2024
@ezyang
Copy link
Contributor

ezyang commented Aug 16, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 16, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_8-cuda12_4-test / test, linux-binary-manywheel / manywheel-py3_8-cuda12_4-split-test / test

Details for Dev Infra team Raised by workflow job

Copy link
Collaborator

@nWEIdia nWEIdia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure reminds me that such updates usually require S3 index updates (nvidia-nccl-cu11 and nvidia-nccl-cu12)

btw, Could this one wait for #132202 please?

We are targeting a Monday merge, praying for a zero conflict :)

@Skylion007 Skylion007 force-pushed the skylion007/update-nccl-2-22-3 branch from 18a29c0 to bd355a2 Compare August 17, 2024 15:52
@Skylion007 Skylion007 added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Aug 17, 2024
@Skylion007 Skylion007 force-pushed the skylion007/update-nccl-2-22-3 branch from bd355a2 to 9a4e880 Compare August 17, 2024 16:49
"nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here https://pypi.org/project/nvidia-nccl-cu11/ version: 2.21.5 is the latest one. I guess we only would need to update cu 12.1 and 12.4

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eqy Do you know why there are no binaries released for 11.8 here? Is this intentional or a release bug?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atalman Unfortunately, this isn't easily possible as Conda use a statically linked one which builds against the git submodule. We could build that one 2.22.5, but it's untested and violates an testing assert. @malfet What are the thoughts going for now that 11.8 is not supported by Nvidia anymore?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 2.22.3 is not supported for 11.8 and we need 2.22.3, we should stop supporting 11.8, but IMO this is a pretty big step to make... On the other hand, those constrains are only needed for poetry, aren't they? In that case, we might build/link with nccl statically for 11.8, but dynamically for 12.x

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We kinda decided to stop supporting 11.8, so maybe we could update nccl? cc @kwen2501

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not remove "nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | " constrains or just add statement, that we don't care about poetry users anymore...

@Skylion007
Copy link
Collaborator Author

Skylion007 commented Aug 20, 2024

Please do not remove "nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | " constrains or just add statement, that we don't care about poetry users anymore...

That's not the issue, the issue is we have a divergence of supported NCCL versions. 11.8 only supports up to 2.21.5 while 12 supports newer versions. This will also be a problem for CUDNN especially as we update it to get needed CUDNN attention fixes.

For NCCL it's a big problem since the CONDA 11.8 and the pypi builds willl diverge due to using different versions of submodule compiled code vs shipped binaries.

@ezyang ezyang self-requested a review August 20, 2024 15:14
@ezyang ezyang dismissed their stale review August 20, 2024 15:14

more discuss

@malfet
Copy link
Contributor

malfet commented Aug 20, 2024

Let's try to outline pros and cons of continuing to maintain 11.8 support in 2.5 (First PyTorch version to be released with 11.8 were PyTorch-2.0)

@Skylion007
Copy link
Collaborator Author

Yeah, we could try to do some submodule hacking to change the tags in different dockers, but that seems like a pain.

@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 20, 2024
@kit1980
Copy link
Contributor

kit1980 commented Aug 30, 2024

Let's deprecate CUDA 11.8 and update nccl at the same time, for 2.6 release probably.

@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Oct 29, 2024
@github-actions github-actions bot closed this Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/trunk Trigger trunk jobs on your pull request open source Stale topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants