Skip to content

Conversation

@d4l3k
Copy link
Member

@d4l3k d4l3k commented Mar 21, 2025

Related to #149153

This updates some build scripts to hopefully fix the nightly builds which are somehow building against nccl 2.25.1 and using 2.26.2 from pip.

Test plan:

After merging rerun nightly linux jobs and validate that nccl version matches

@d4l3k d4l3k requested review from Skylion007, atalman and malfet March 21, 2025 23:40
@d4l3k d4l3k requested a review from jeffdaily as a code owner March 21, 2025 23:40
@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Mar 21, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 21, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149778

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 32 Pending, 1 Unrelated Failure

As of commit e40be77 with merge base db9b031 (image):

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

set -ex

NCCL_VERSION=v2.25.1-1
NCCL_VERSION=v2.26.2-1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about arm arch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct. This is still used. We will be consolidating these scripts: #149554

@Skylion007
Copy link
Collaborator

Looks good once the corresponding arm64 script is also updated

maybe_libnccl_dev="libnccl2=2.15.5-1+cuda11.8 libnccl-dev=2.15.5-1+cuda11.8 --allow-downgrades --allow-change-held-packages"
elif [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "12.4"* ]]; then
maybe_libnccl_dev="libnccl2=2.25.1-1+cuda12.4 libnccl-dev=2.25.1-1+cuda12.4 --allow-downgrades --allow-change-held-packages"
maybe_libnccl_dev="libnccl2=2.26.2-1+cuda12.4 libnccl-dev=2.26.2-1+cuda12.4 --allow-downgrades --allow-change-held-packages"
Copy link
Contributor

@atalman atalman Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can remove this if case as a followup since we don't have CUDA 12.4 in our CI anymore

Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@atalman
Copy link
Contributor

atalman commented Mar 24, 2025

@pytorchmergebot merge -f "lint and docker builds are green"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@d4l3k d4l3k deleted the d4l3k/nccl_build_2.25 branch March 24, 2025 16:47
@atalman
Copy link
Contributor

atalman commented Mar 24, 2025

@pytorchbot cherry-pick --onto release/2.7 --fixes "nccl update" -c critical

pytorchbot pushed a commit that referenced this pull request Mar 24, 2025
Related to #149153

This updates some build scripts to hopefully fix the nightly builds which are somehow building against nccl 2.25.1 and using 2.26.2 from pip.

Test plan:

After merging rerun nightly linux jobs and validate that nccl version matches
Pull Request resolved: #149778
Approved by: https://github.com/Skylion007, https://github.com/atalman

Co-authored-by: Andrey Talman <[email protected]>
(cherry picked from commit ddc0fe9)
@pytorchbot
Copy link
Collaborator

Cherry picking #149778

The cherry pick PR is at #149874 and it is linked with issue nccl update. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

atalman added a commit to atalman/pytorch that referenced this pull request Mar 26, 2025
Related to pytorch#149153

This updates some build scripts to hopefully fix the nightly builds which are somehow building against nccl 2.25.1 and using 2.26.2 from pip.

Test plan:

After merging rerun nightly linux jobs and validate that nccl version matches
Pull Request resolved: pytorch#149778
Approved by: https://github.com/Skylion007, https://github.com/atalman

Co-authored-by: Andrey Talman <[email protected]>
(cherry picked from commit ddc0fe9)
atalman pushed a commit that referenced this pull request Mar 26, 2025
ci/docker: use NCCL 2.26.2-1 (#149778)

Related to #149153

This updates some build scripts to hopefully fix the nightly builds which are somehow building against nccl 2.25.1 and using 2.26.2 from pip.

Test plan:

After merging rerun nightly linux jobs and validate that nccl version matches
Pull Request resolved: #149778
Approved by: https://github.com/Skylion007, https://github.com/atalman

Co-authored-by: Andrey Talman <[email protected]>
(cherry picked from commit ddc0fe9)

Co-authored-by: Tristan Rice <[email protected]>
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
Related to pytorch#149153

This updates some build scripts to hopefully fix the nightly builds which are somehow building against nccl 2.25.1 and using 2.26.2 from pip.

Test plan:

After merging rerun nightly linux jobs and validate that nccl version matches
Pull Request resolved: pytorch#149778
Approved by: https://github.com/Skylion007, https://github.com/atalman

Co-authored-by: Andrey Talman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants