Enable CUDA 12.4.1 #132202

nWEIdia · 2024-07-31T01:37:15Z

Trying to keep a record of the steps before I lose track of it.

1st Commit: Similar to add CUDA 12.4 workflow for docker image build builder#1720
2nd Commit: Update CUDA 12.4 CI CUDA versions from 12.4.0 to 12.4.1 mapping to changes in https://github.com/pytorch/pytorch/pull/125944/files
3rd Commit: update for aarch64 install_cuda_aarch64.sh docker step
4th Commit: aaa456e Related Add CUDA 12.4 workflows #121684
Synchronization point: Meta helps uploading pypi cuda dependencies specified in .github/scripts/generate_binary_build_matrix.py
The above pypi upload is done (thanks Andrey!), restarted jobs like https://github.com/pytorch/pytorch/actions/runs/10188203670/job/28369471321
7753234, use temporary docker containers (generated from a previous successful container build). If merged, these containers would be rebuilt, therefore testing them now. (5th commit)
6th commit 5f93c62: revert the 5th commit. Update, done but have to debug seemingly irrelevant failures (rocm/xpu/mps)

update:
Post Merge:
Step 1: TLDR, watch for simultaneous pytorch/test-infra AMI updates that could break in-between green signals and final landing. See #132202 (comment)
Step 2: Since this is CUDA 12.4.0 -> CUDA 12.4.1 bump, on the Windows side, just need to check the driver version is bumped according to CUDA12.4.1 version's recommended driver version.
https://github.com/pytorch/test-infra/blob/0c3a2634aaa2f638c8f640e743f03d696ce1191f/aws/ami/windows/scripts/Installers/Install-CUDA-Tools.ps1#L33
still shows 551.61, needs to bump to 551.78 according to Table 3 of https://docs.nvidia.com/cuda/archive/12.4.1/cuda-toolkit-release-notes/index.html
Step 3: Trigger inductor performance testing, e.g. https://github.com/pytorch/pytorch/actions/runs/10527627251

cc @atalman @ptrblck @eqy @tinglvv @malfet

pytorch-bot · 2024-07-31T01:37:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/132202

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 4 Unrelated Failures

As of commit 29e15e9 with merge base 8624a57 ():

NEW FAILURES - The following jobs have failed:

linux-binary-libtorch-cxx11-abi / libtorch-rocm6_0-shared-with-deps-cxx11-abi-build / build (gh)
ninja: build stopped: subcommand failed
linux-binary-libtorch-pre-cxx11 / libtorch-rocm6_0-shared-with-deps-pre-cxx11-build / build (gh)
ninja: build stopped: subcommand failed
linux-binary-manywheel / manywheel-py3_10-rocm6_0-build / build (gh)
ninja: build stopped: subcommand failed
linux-binary-manywheel / manywheel-py3_11-rocm6_0-build / build (gh)
ninja: build stopped: subcommand failed
linux-binary-manywheel / manywheel-py3_12-rocm6_0-build / build (gh)
ninja: build stopped: subcommand failed
linux-binary-manywheel / manywheel-py3_9-rocm6_0-build / build (gh)
ninja: build stopped: subcommand failed
trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh)
inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_argsort_cpu_int64

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

periodic / win-vs2019-cuda11.8-py3 / test (default, 3, 4, windows.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
test_nestedtensor.py::TestNestedTensorSubclassCUDA::test_sdpa_autocast_cuda
trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable) (gh) (trunk failure)
inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_baddbmm_dynamic_shapes_cpu
trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh) (trunk failure)
inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_argmax_argmin_with_duplicates_dynamic_shapes_cpu

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

periodic / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 3, linux.rocm.gpu, unstable) (gh) (#129209)
distributed/tensor/parallel/test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_2_gather_dim_0

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Skylion007 · 2024-07-31T13:22:19Z

What's the goal of updating the minor version? Are there important perf / bug fixes? Does it update the minimum driver version or requirements? If we are doing this we should also consider updating NCCL and CUDNN in a concurrent PR probably.

nWEIdia · 2024-07-31T16:11:42Z

What's the goal of updating the minor version? Are there important perf / bug fixes? Does it update the minimum driver version or requirements? If we are doing this we should also consider updating NCCL and CUDNN in a concurrent PR probably.

The update originally comes from the suggestion of @malfet and @ptrblck to always use the maximum minor version of CUDA 12.4 which is CUDA 12.4.1.
At least it would fix nanogpt smoke test regression, see https://github.com/pytorch/pytorch/blob/main/.ci/pytorch/test.sh#L640

The Linux minimum driver version (550.54.15) needed by CUDA 12.4.1 was already adopted when enabling CUDA 12.4.0 [yeah, we probably should have just enabled CUDA 12.4.1 then]. Only the Windows side, the driver may need a bump.

NCCL and CUDNN with CUDA 12.4.1 may work similar to CUDA 12.4.0, we could consider updating them though.

nWEIdia · 2024-08-01T19:03:41Z

For Pip Binary/jobs:
https://ossci-raw-job-status.s3.amazonaws.com/log/28186744872 error below
pip install /final_pkgs/torch-2.5.0.dev20240731+cu124-cp310-cp310-linux_x86_64.whl --index-url https://download.pytorch.org/whl/nightly/cu124
No matching distribution found for nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64"
says the pypi packages have to be copied and hosted from aws s3 pytorch (https://download.pytorch.org/whl/nightly/cu124).

For Conda Binary/jobs:
Linking: pytorch/builder#1947

Skylion007 · 2024-08-01T19:08:56Z

@atalman Can help with the nightly packages

Skylion007 · 2024-08-03T17:18:05Z

Also pinging @malfet . @nWEIdia Are they are any older versions of CUDA we do should do minor version bumps to? Like 11.8 or 12.1?

atalman · 2024-08-05T13:41:23Z

@nWEIdia @Skylion007 uploaded packages for 12.4 and 12.4 split build.

nWEIdia · 2024-08-05T18:37:23Z

Also pinging @malfet . @nWEIdia Are they are any older versions of CUDA we do should do minor version bumps to? Like 11.8 or 12.1?

https://developer.nvidia.com/cuda-toolkit-archive shows CUDA 11.8 does not have patch versions. It looks like 12.1 indeed has an update 1. Fortunately, upstream pytorch already comes with 12.1.1 as indicated by https://github.com/pytorch/pytorch/blob/main/.github/scripts/generate_binary_build_matrix.py#L21

atalman

lgtm, please make sure manywheels builds are green, revert temp change and merge

…vidia.com/cuda/archive/12.4.1/cuda-toolkit-release-notes/index.html Related pytorch#121684

This reverts commit 7753234.

pytorchmergebot · 2024-08-15T03:27:41Z

Successfully rebased enable_cuda_12.4.1 onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout enable_cuda_12.4.1 && git pull --rebase)

nWEIdia · 2024-08-15T16:25:38Z

Looks like the signal is green for CUDA tests but the 6 pending jobs are due to long queue of runner linux.9xlarge.ephemeral

cc @atalman @malfet

nWEIdia · 2024-08-15T16:30:14Z

For this kind of PRs, it would be great if we can retain the test results from 653b4ac

Update: the 653b4ac is a rebased commit from https://hud.pytorch.org/pytorch/pytorch/commit/77532344e35f7bd8d6e4c221e1b6807e17734204
fortunately, the commit before the rebase and test results are still available!

How to get to the old commit? Post a job link in this PR (e.g. https://github.com/pytorch/pytorch/actions/runs/10295868135/job/28500435981 ) this can serve as index to find the old commit ID.

nWEIdia · 2024-08-20T17:45:20Z

@pytorchbot merge -i

pytorchmergebot · 2024-08-20T17:47:07Z

Merge started

Your change will be merged while ignoring the following 11 checks: trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable), trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable), trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable), linux-binary-manywheel / manywheel-py3_11-rocm6_0-build / build, linux-binary-manywheel / manywheel-py3_12-rocm6_0-build / build, linux-binary-manywheel / manywheel-py3_9-rocm6_0-build / build, linux-binary-manywheel / manywheel-py3_10-rocm6_0-build / build, linux-binary-libtorch-cxx11-abi / libtorch-rocm6_0-shared-with-deps-cxx11-abi-build / build, linux-binary-libtorch-pre-cxx11 / libtorch-rocm6_0-shared-with-deps-pre-cxx11-build / build, periodic / win-vs2019-cuda11.8-py3 / test (default, 3, 4, windows.g5.4xlarge.nvidia.gpu), periodic / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 3, linux.rocm.gpu, unstable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

nWEIdia · 2024-08-21T00:54:30Z

Post Merge:
Step 1: check https://hud.pytorch.org/pytorch/pytorch/commit/333890b701fc30065e35b6ae6f24908ecb9d81f8
Observed job failures: https://github.com/pytorch/pytorch/actions/runs/10476660034/job/29016052870

Update: https://github.com/pytorch/pytorch/actions/workflows/build-manywheel-images.yml shows that failure is existing on trunk https://github.com/pytorch/pytorch/actions/runs/10460666895 and https://github.com/pytorch/pytorch/actions/runs/10463836139

Step 1: TLDR, watch for simultaneous pytorch/test-infra AMI updates that could break in-between green signals and final landing. e.g. the above incident.

nWEIdia · 2024-08-21T05:34:03Z

1st bad build: https://github.com/pytorch/pytorch/actions/runs/10433835042/job/28936166917

2024-08-19T09:45:31.8316859Z #15 421.6 Installing : fipscheck-1.4.1-6.el7.x86_64 2/73
2024-08-19T09:52:25.7608721Z #15 421.7 Installing : fipscheck-lib-1.4.1-6.el7.x86_64 3/73

See the 7 minutes gap!!

Last known good build: https://github.com/pytorch/pytorch/actions/runs/10425348037/job/28876036438
The installation of these two are instant!

Could the runner be from a different region? If not, the multiple "RUN sed -i s/^#.baseurl=http/baseurl=http/g /etc/yum.repos.d/.repo" seems suspicious.

cc @atalman @malfet @ptrblck

atalman · 2024-08-21T14:29:03Z

@nWEIdia Looks like this is maybe related to AMI update:
Last successful build: https://github.com/pytorch/pytorch/actions/runs/10424302444/job/28872834620 uses AMI Name: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

Current build takes 5hrs and uses:
https://github.com/pytorch/pytorch/actions/runs/10463836139/job/28976474949
AMI Name: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

cc: @ZainRizvi

ZainRizvi · 2024-08-21T15:41:16Z

yeah, this does seem related to the new Amazon Linux 2023 ami upgrade. The green builds all use the Amazon 2 AMI while the red ones are on Amazon 2023. @atalman is working on a mitigation PR

nWEIdia · 2024-09-10T17:47:27Z

Post merge Step 4:
Missed steps: aarch64 binary validation (building was ok, validation of the binary in builder repo was not)

https://github.com/pytorch/builder/actions/runs/10776424717/job/29883178793#step:12:4025

tinglvv · 2024-09-10T19:49:11Z

Above aarch64 build failure is seen on cuda aarch64.
Error https://github.com/pytorch/builder/actions/runs/10776424717/job/29883178793#step:12:4025

++ python3 ./test/smoke_test/smoke_test.py --package torchonly
Traceback (most recent call last):
  File "/pytorch/builder/./test/smoke_test/smoke_test.py", line 9, in <module>
    import torch._dynamo
  File "/opt/conda/envs/conda-env-10794919545/lib/python3.9/site-packages/torch/_dynamo/__init__.py", line 3, in <module>
    from . import convert_frame, eval_frame, resume_execution
  File "/opt/conda/envs/conda-env-10794919545/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 53, in <module>
    from . import config, exc, trace_rules
  File "/opt/conda/envs/conda-env-10794919545/lib/python3.9/site-packages/torch/_dynamo/trace_rules.py", line 45, in <module>
    from .utils import getfile, hashable, NP_SUPPORTED_MODULES, unwrap_if_wrapper
ImportError: cannot import name 'NP_SUPPORTED_MODULES' from 'torch._dynamo.utils' (/opt/conda/envs/conda-env-10794919545/lib/python3.9/site-packages/torch/_dynamo/utils.py)

Will need to see if this is due to CUDA 12.4.1 upgrade and fix it for 2.5.0 release

cc @atalman

nWEIdia requested a review from jeffdaily as a code owner July 31, 2024 01:37

pytorch-bot bot added the topic: not user facing topic category label Jul 31, 2024

pytorchbot added the open source label Jul 31, 2024

nWEIdia requested a review from a team as a code owner July 31, 2024 09:04

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 31, 2024

nWEIdia force-pushed the enable_cuda_12.4.1 branch from aaa456e to 46c2b60 Compare July 31, 2024 21:52

nWEIdia added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Jul 31, 2024

Skylion007 approved these changes Aug 1, 2024

View reviewed changes

Skylion007 requested a review from atalman August 1, 2024 19:08

nWEIdia force-pushed the enable_cuda_12.4.1 branch from 46c2b60 to c18ac47 Compare August 5, 2024 21:29

eqy approved these changes Aug 6, 2024

View reviewed changes

nWEIdia force-pushed the enable_cuda_12.4.1 branch from c18ac47 to 7753234 Compare August 6, 2024 19:47

atalman approved these changes Aug 6, 2024

View reviewed changes

nWEIdia added the ciflow/inductor label Aug 8, 2024

nWEIdia added 3 commits August 15, 2024 03:27

Update cuda 12.4.1 main component version, consulting: https://docs.n…

ac187d0

…vidia.com/cuda/archive/12.4.1/cuda-toolkit-release-notes/index.html Related pytorch#121684

Use temp docker containers

653b4ac

Revert "Use temp docker containers"

29e15e9

This reverts commit 7753234.

pytorchmergebot force-pushed the enable_cuda_12.4.1 branch from 5f93c62 to 29e15e9 Compare August 15, 2024 03:27

nWEIdia mentioned this pull request Aug 16, 2024

[BE]: Update NCCL submodule to 2.22.3 #133593

Closed

This comment was marked as off-topic.

Sign in to view

This comment was marked as resolved.

Sign in to view

pytorchmergebot added the merging label Aug 20, 2024

pytorchmergebot added the Merged label Aug 20, 2024

pytorchmergebot closed this in 333890b Aug 20, 2024

pytorchmergebot removed the merging label Aug 20, 2024

atalman mentioned this pull request Aug 21, 2024

Use amzn2 for linux.9xlarge.ephemeral pytorch/test-infra#5581

Closed

This comment was marked as off-topic.

Sign in to view

nWEIdia mentioned this pull request Sep 11, 2024

Work-around aarch64 conda installed numpy 2.x version. pytorch/builder#1984

Merged

tinglvv mentioned this pull request Oct 25, 2024

Enable CUDA 12.6 CI/CD , Disable CUDA 12.1 #138440

Closed

31 tasks

Enable CUDA 12.4.1 #132202

Enable CUDA 12.4.1 #132202

Uh oh!

Conversation

nWEIdia commented Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/132202

❌ 7 New Failures, 4 Unrelated Failures

Uh oh!

Skylion007 commented Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nWEIdia commented Jul 31, 2024

Uh oh!

nWEIdia commented Aug 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skylion007 commented Aug 1, 2024

Uh oh!

Skylion007 commented Aug 3, 2024

Uh oh!

atalman commented Aug 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nWEIdia commented Aug 5, 2024

Uh oh!

atalman left a comment

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Aug 15, 2024

Uh oh!

nWEIdia commented Aug 15, 2024

Uh oh!

nWEIdia commented Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as resolved.

nWEIdia commented Aug 20, 2024

Uh oh!

pytorchmergebot commented Aug 20, 2024

Merge started

Uh oh!

nWEIdia commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nWEIdia commented Aug 21, 2024

Uh oh!

atalman commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZainRizvi commented Aug 21, 2024

Uh oh!

This comment was marked as off-topic.

nWEIdia commented Sep 10, 2024

Uh oh!

tinglvv commented Sep 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

nWEIdia commented Jul 31, 2024 •

edited

Loading

pytorch-bot bot commented Jul 31, 2024 •

edited

Loading

Skylion007 commented Jul 31, 2024 •

edited

Loading

nWEIdia commented Aug 1, 2024 •

edited

Loading

atalman commented Aug 5, 2024 •

edited

Loading

nWEIdia commented Aug 15, 2024 •

edited

Loading

nWEIdia commented Aug 21, 2024 •

edited

Loading

atalman commented Aug 21, 2024 •

edited

Loading