[CUDA] [CI]: Enable CUDA 12.4 CI #121956

nWEIdia · 2024-03-15T06:48:06Z

Reference PR: #93406

cc @atalman @malfet @ptrblck @eqy

pytorch-bot · 2024-03-15T06:48:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121956

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit a4a5d05 with merge base 5ea956a ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / cuda12.1-py3.12-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_bmm_multiple_dynamic_abi_compatible_cuda

This comment was automatically generated by Dr. CI and updates every 15 minutes.

malfet

Hmm, I think it needs some discussion, because if we are to stop building/testing CUDA-11.8, it means we are loosing Keplers, don't we?
@atalman is there a doc whether this is intended

johnnynunez · 2024-03-17T18:38:06Z

Hmm, I think it needs some discussion, because if we are to stop building/testing CUDA-11.8, it means we are loosing Keplers, don't we? @atalman is there a doc whether this is intended

why not mantain 11.8 as cuda 11 and 12.4 as cuda 12? And skip 12.1. I mean maintain always two versions of cuda,
for example if cuda 13 is out, the newer versions it would be 12 and 13

nWEIdia · 2024-03-18T06:58:10Z

We discussed for a short term, we would have 11.8, 12.1, and 12.4. I will need to refactor this PR to add back 11.8.

johnnynunez · 2024-03-23T23:31:59Z

when will be merged? 😋

nWEIdia · 2024-03-24T06:00:38Z

12.4 workflows are failing. Still working on coming up with a fix.

johnnynunez · 2024-04-04T09:08:09Z

12.4 update 1 is out:
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/

johnnynunez · 2024-04-09T20:59:39Z

@ptrblck @nWEIdia nvidia cudnn9 now is available nvidia-cudnn-cu12 9.0.0.312
https://pypi.org/project/nvidia-cudnn-cu12/

ptrblck · 2024-04-09T21:19:55Z

@johnnynunez Yes, it is! We will focus on 12.4 in this PR and follow up with the cuDNN update separately to avoid creating confusing issues pointing to the CUDA and cuDNN update.

pytorchmergebot · 2024-05-01T00:26:25Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-05-01T00:26:31Z

Successfully rebased cuda_124_ci onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cuda_124_ci && git pull --rebase)

pytorch-bot · 2024-05-10T00:18:51Z

Warning: Unknown label ciflow/inductor-perf-test-nightly.
Currently recognized labels are

ciflow/binaries
ciflow/binaries_conda
ciflow/binaries_libtorch
ciflow/binaries_wheel
ciflow/inductor
ciflow/inductor-perf-compare
ciflow/inductor-micro-benchmark
ciflow/linux-aarch64
ciflow/mps
ciflow/nightly
ciflow/periodic
ciflow/rocm
ciflow/slow
ciflow/trunk
ciflow/unstable
ciflow/xpu
ciflow/torchbench

Please add the new label to .github/pytorch-probot.yml

pytorchmergebot · 2024-05-19T16:12:05Z

Successfully rebased cuda_124_ci onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cuda_124_ci && git pull --rebase)

Fixes issues encountered in pytorch#121956 Pull Request resolved: pytorch#125944 Approved by: https://github.com/atalman

nWEIdia · 2024-05-20T18:54:09Z

@malfet Could you please help take another look?
I am composing torchinductor 12.4 issues in here.
Thanks!

atalman · 2024-05-21T17:40:47Z

Hi @nWEIdia please disable the failing tests. We will follow up on this in the issue you opened

PR. Require manual testing. Planning to do it via a separate PR.

atalman · 2024-05-23T20:35:59Z

@pytorchmergebot merge -f "All required tests are pasing"

pytorchmergebot · 2024-05-23T20:37:34Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

huydhn · 2024-05-24T17:38:13Z

For the context, after this change lands in trunk, the new CUDA 12.4 build starts to fails on newly created open PyTorch PR. Here is what happens:

This PR includes the fix for the build failure in caffe2/CMakeLists.txt
After this lands in trunk, newer PRs will now start running CUDA 12.4 build coming from the trunk version of workflow. We now have 2 cases:
1. If the newer PRs has https://hud.pytorch.org/pytorch/pytorch/commit/0902929d582879caa926705f283f5f55864bd7bf, they will build fine because they have the fix
2. If the newer PRs don't have the above commit, they will fail without the fix, for example Default TreadPool size to number of physical cores #125963

This is not an ideal rollout, but the way for now is to ask folks to rebase onto main

nWEIdia · 2024-05-24T18:03:42Z

Sorry for the mishaps.

The PR went in 05/23 1:37pm, @malfet issued a "pytorch rebase" at 2:22pm on the #125963 PR, the result is based on #126976 (10:31am)

I guess the lesson is we should request an immediate push to viable/strict for future occurrences like this.

@clee2000

Discovered by @clee2000. The change was introduced in #121956 Pull Request resolved: #127121 Approved by: https://github.com/clee2000, https://github.com/Skylion007

Reference PR: pytorch#93406 Co-authored-by: Aidyn-A <[email protected]> Pull Request resolved: pytorch#121956 Approved by: https://github.com/atalman

@clee2000

Discovered by @clee2000. The change was introduced in pytorch#121956 Pull Request resolved: pytorch#127121 Approved by: https://github.com/clee2000, https://github.com/Skylion007

nWEIdia requested a review from a team as a code owner March 15, 2024 06:48

pytorch-bot bot added the topic: not user facing topic category label Mar 15, 2024

pytorchbot added the open source label Mar 15, 2024

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 15, 2024

malfet requested changes Mar 15, 2024

View reviewed changes

nWEIdia changed the title ~~CUDA CI changes: 11.8->12.1, 12.1->12.4~~ Draft: CUDA CI changes: 11.8->12.1, 12.1->12.4 Mar 18, 2024

nWEIdia changed the title ~~Draft: CUDA CI changes: 11.8->12.1, 12.1->12.4~~ CUDA CI changes: Add CUDA 12.4 CI Mar 22, 2024

nWEIdia force-pushed the cuda_124_ci branch from d60dcfc to 3678581 Compare April 10, 2024 17:38

pytorchmergebot force-pushed the cuda_124_ci branch from 3678581 to 53df6a6 Compare May 1, 2024 00:26

nWEIdia force-pushed the cuda_124_ci branch from 53df6a6 to c2b3055 Compare May 2, 2024 07:44

nWEIdia force-pushed the cuda_124_ci branch from c2b3055 to d6b4f77 Compare May 9, 2024 23:45

nWEIdia requested a review from jeffdaily as a code owner May 10, 2024 00:16

nWEIdia added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/inductor ciflow/inductor-perf-test-nightly Trigger nightly inductor perf tests ciflow/slow labels May 10, 2024

nWEIdia changed the title ~~CUDA CI changes: Add CUDA 12.4 CI~~ [CUDA] [CI]: Enable CUDA 12.4 CI May 10, 2024

nWEIdia added the ciflow/inductor-perf-compare label May 10, 2024

Aidyn-A and others added 3 commits May 19, 2024 16:12

suppress deprecation cusparse warnings v2

ec87179

Add missing inductor py3.12 cu12.4 build job

1bab2d1

suppress deprecation cusparse warnings v3: Linux only

c8c7ddf

pytorchmergebot force-pushed the cuda_124_ci branch from bd6bbf3 to c8c7ddf Compare May 19, 2024 16:12

nWEIdia mentioned this pull request May 20, 2024

CUDA 12.4 CI Inductor Issues #126692

Closed

nWEIdia added 3 commits May 21, 2024 13:27

Undo inductor-perf-test-nightly.yml changes as those were not tested by

0b794c6

PR. Require manual testing. Planning to do it via a separate PR.

Undo slow changes as well.

426d074

Disabling test shards given pytorch#126692

a4a5d05

Fuzzkatt mentioned this pull request May 21, 2024

[DO NOT MERGE] Fuzzkatt/cuda 124 ci debug #126825

Closed

atalman approved these changes May 23, 2024

View reviewed changes

pytorchmergebot added the merging label May 23, 2024

pytorchmergebot closed this in 0902929 May 23, 2024

pytorchmergebot added Merged and removed merging labels May 23, 2024

atalman mentioned this pull request May 24, 2024

UNSTABLE pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / build #127104

Closed

atalman mentioned this pull request May 24, 2024

UNSTABLE pull / linux-focal-cuda12.4-py3.10-gcc9 / build #127108

Closed

huydhn mentioned this pull request May 24, 2024

Fix typo in inductor workflow for CUDA 12.4 jobs #127121

Closed

tinglvv mentioned this pull request Nov 15, 2024

Enable CUDA 12.6 OSS CI #140793

Closed

[CUDA] [CI]: Enable CUDA 12.4 CI #121956

[CUDA] [CI]: Enable CUDA 12.4 CI #121956

Uh oh!

Conversation

nWEIdia commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121956

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

johnnynunez commented Mar 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nWEIdia commented Mar 18, 2024

Uh oh!

johnnynunez commented Mar 23, 2024

Uh oh!

nWEIdia commented Mar 24, 2024

Uh oh!

johnnynunez commented Apr 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnnynunez commented Apr 9, 2024

Uh oh!

ptrblck commented Apr 9, 2024

Uh oh!

pytorchmergebot commented May 1, 2024

Uh oh!

pytorchmergebot commented May 1, 2024

Uh oh!

pytorch-bot bot commented May 10, 2024

Uh oh!

pytorchmergebot commented May 19, 2024

Uh oh!

nWEIdia commented May 20, 2024

Uh oh!

atalman commented May 21, 2024

Uh oh!

atalman commented May 23, 2024

Uh oh!

pytorchmergebot commented May 23, 2024

Merge started

Uh oh!

huydhn commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nWEIdia commented May 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

nWEIdia commented Mar 15, 2024 •

edited

Loading

pytorch-bot bot commented Mar 15, 2024 •

edited

Loading

johnnynunez commented Mar 17, 2024 •

edited

Loading

johnnynunez commented Apr 4, 2024 •

edited

Loading

huydhn commented May 24, 2024 •

edited

Loading