-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Enable CUDA 12.4.1 #132202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable CUDA 12.4.1 #132202
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/132202
Note: Links to docs will display an error until the docs builds have been completed. ❌ 7 New Failures, 4 Unrelated FailuresAs of commit 29e15e9 with merge base 8624a57 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
What's the goal of updating the minor version? Are there important perf / bug fixes? Does it update the minimum driver version or requirements? If we are doing this we should also consider updating NCCL and CUDNN in a concurrent PR probably. |
The update originally comes from the suggestion of @malfet and @ptrblck to always use the maximum minor version of CUDA 12.4 which is CUDA 12.4.1. The Linux minimum driver version (550.54.15) needed by CUDA 12.4.1 was already adopted when enabling CUDA 12.4.0 [yeah, we probably should have just enabled CUDA 12.4.1 then]. Only the Windows side, the driver may need a bump. NCCL and CUDNN with CUDA 12.4.1 may work similar to CUDA 12.4.0, we could consider updating them though. |
aaa456e to
46c2b60
Compare
|
For Pip Binary/jobs: For Conda Binary/jobs: |
|
@atalman Can help with the nightly packages |
|
@nWEIdia @Skylion007 uploaded packages for 12.4 and 12.4 split build. |
https://developer.nvidia.com/cuda-toolkit-archive shows CUDA 11.8 does not have patch versions. It looks like 12.1 indeed has an update 1. Fortunately, upstream pytorch already comes with 12.1.1 as indicated by https://github.com/pytorch/pytorch/blob/main/.github/scripts/generate_binary_build_matrix.py#L21 |
46c2b60 to
c18ac47
Compare
c18ac47 to
7753234
Compare
atalman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, please make sure manywheels builds are green, revert temp change and merge
This reverts commit 7753234.
|
Successfully rebased |
5f93c62 to
29e15e9
Compare
|
For this kind of PRs, it would be great if we can retain the test results from 653b4ac Update: the 653b4ac is a rebased commit from https://hud.pytorch.org/pytorch/pytorch/commit/77532344e35f7bd8d6e4c221e1b6807e17734204 How to get to the old commit? Post a job link in this PR (e.g. https://github.com/pytorch/pytorch/actions/runs/10295868135/job/28500435981 ) this can serve as index to find the old commit ID. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as resolved.
This comment was marked as resolved.
|
@pytorchbot merge -i |
|
Post Merge: Update: https://github.com/pytorch/pytorch/actions/workflows/build-manywheel-images.yml shows that failure is existing on trunk https://github.com/pytorch/pytorch/actions/runs/10460666895 and https://github.com/pytorch/pytorch/actions/runs/10463836139 Step 1: TLDR, watch for simultaneous pytorch/test-infra AMI updates that could break in-between green signals and final landing. e.g. the above incident. |
|
2024-08-19T09:45:31.8316859Z #15 421.6 Installing : fipscheck-1.4.1-6.el7.x86_64 2/73 See the 7 minutes gap!! Last known good build: https://github.com/pytorch/pytorch/actions/runs/10425348037/job/28876036438 Could the runner be from a different region? If not, the multiple "RUN sed -i s/^#.baseurl=http/baseurl=http/g /etc/yum.repos.d/.repo" seems suspicious. |
|
@nWEIdia Looks like this is maybe related to AMI update: Current build takes 5hrs and uses: cc: @ZainRizvi |
|
yeah, this does seem related to the new Amazon Linux 2023 ami upgrade. The green builds all use the Amazon 2 AMI while the red ones are on Amazon 2023. @atalman is working on a mitigation PR |
This comment was marked as off-topic.
This comment was marked as off-topic.
|
Post merge Step 4: |
|
Above aarch64 build failure is seen on cuda aarch64. Will need to see if this is due to CUDA 12.4.1 upgrade and fix it for 2.5.0 release cc @atalman |
Trying to keep a record of the steps before I lose track of it.
update:
Post Merge:
Step 1: TLDR, watch for simultaneous pytorch/test-infra AMI updates that could break in-between green signals and final landing. See #132202 (comment)
Step 2: Since this is CUDA 12.4.0 -> CUDA 12.4.1 bump, on the Windows side, just need to check the driver version is bumped according to CUDA12.4.1 version's recommended driver version.
https://github.com/pytorch/test-infra/blob/0c3a2634aaa2f638c8f640e743f03d696ce1191f/aws/ami/windows/scripts/Installers/Install-CUDA-Tools.ps1#L33
still shows 551.61, needs to bump to 551.78 according to Table 3 of https://docs.nvidia.com/cuda/archive/12.4.1/cuda-toolkit-release-notes/index.html
Step 3: Trigger inductor performance testing, e.g. https://github.com/pytorch/pytorch/actions/runs/10527627251
cc @atalman @ptrblck @eqy @tinglvv @malfet