Skip to content

Conversation

@Skylion007
Copy link
Collaborator

Update to CUDNN 9.10.2.21

@Skylion007 Skylion007 requested review from a team and jeffdaily as code owners June 10, 2025 17:22
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155576

Note: Links to docs will display an error until the docs builds have been completed.

❌ 8 New Failures, 3 Unrelated Failures

As of commit 8315757 with merge base 9328a7f (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jun 10, 2025
@nWEIdia
Copy link
Collaborator

nWEIdia commented Jun 10, 2025

cc @atalman for help uploading the cudnn packages.

@Skylion007 Skylion007 force-pushed the skylion007/update-cudnn-9-10-2-21 branch from 6634853 to 2e411ca Compare June 10, 2025 18:57
Copy link
Collaborator

@nWEIdia nWEIdia Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't bump the cudnn version for cu126, as we don't have enough test combination done with cu126 and this 9.10.2.21.
Recommend keeping cuDNN version unchanged for cuda 12.6.
But open to what others prefer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other aspect to this (cuda 12.6 + 9.10.2.21) is that as time goes by, the cuda 12.6 is getting tested less in CI due to ongoing efforts to move CI from cuda 12.6 to 12.8.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important performance updates for SDPA on A100s/H100s here though, and better to support fewer CUDNN versions.

@nWEIdia
Copy link
Collaborator

nWEIdia commented Jun 10, 2025

Aligned with #154980 cc @tinglvv

@Skylion007 Skylion007 added the ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR label Jun 11, 2025
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todo fix

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important performance updates for SDPA on A100s/H100s here though, and better to support fewer CUDNN versions.

@Skylion007 Skylion007 force-pushed the skylion007/update-cudnn-9-10-2-21 branch from 2e411ca to 8315757 Compare June 11, 2025 15:07
@Skylion007
Copy link
Collaborator Author

@pytorchbot merge -i

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 11, 2025
@pytorchmergebot
Copy link
Collaborator

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@tinglvv
Copy link
Collaborator

tinglvv commented Jun 11, 2025

Hi @Skylion007, seems 12.8 was missed in this PR .ci/docker/common/install_cuda.sh. Please followup with a fix to update, thanks.

@malfet
Copy link
Contributor

malfet commented Jun 11, 2025

@pytorchbot revert -m "breaks the same test again (I remember there were a version that adjusted tolerances), see https://hud.pytorch.org/hud/pytorch/pytorch/bc3972b80a7abe85036f48b610532fce39ea5097/1?per_page=50&name_filter=gcc11-sm89&mergeEphemeralLF=true" -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@Skylion007 your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Jun 11, 2025
@atalman
Copy link
Contributor

atalman commented Jun 11, 2025

I tuned the test here so landing this PR should fix it: #155234

@atalman
Copy link
Contributor

atalman commented Jun 12, 2025

@pytorchmergebot merge -f "fix for the failure is deployed #155234"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Skylion007
Copy link
Collaborator Author

Apologies here, should have double check this before @atalman remerged the PR. Interesting this should have raised a warning on CUDNN frontend logger, but surprised nobody reported it. I guess because the nightlies are statically linked often?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged open source Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants