Skip to content

Conversation

@eqy
Copy link
Collaborator

@eqy eqy commented Oct 29, 2025

Repros without the neeed for specific tensor data.
Should be passing with cuDNN frontend 1.15.0 which current main has.

cc @csarofeen @ptrblck @xwang233

@eqy eqy added module: cudnn Related to torch.backends.cudnn, and CuDNN support open source topic: not user facing topic category module: sdpa All things related to torch.nn.functional.scaled_dot_product_attentiion labels Oct 29, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 29, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166570

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit a9a6d97 with merge base fc540ce (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@eqy
Copy link
Collaborator Author

eqy commented Oct 29, 2025

CC @malfet @atalman

Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Thank you

Comment on lines +2848 to +2849
@skipIfRocm
@unittest.skipIf(not PLATFORM_SUPPORTS_CUDNN_ATTENTION, "cudnn Attention is not supported on this system")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to skip the test on those platforms?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is explicitly testing for a cuDNN bug: with torch.nn.attention.sdpa_kernel(torch.nn.attention.SDPBackend.CUDNN_ATTENTION):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wish we had a strict CUDAOnly for this then.

eqy and others added 2 commits October 29, 2025 18:07
k.requires_grad = True
v.requires_grad = True

grad_attn_output = torch.randn(*shape, device='cuda', dtype=torch.bfloat16) * scale
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use torch.autograd.grad to reuse the exact input tensors

attn_output.backward(grad_attn_output)

for x, x_ref in zip((q, k, v), (q_ref, k_ref, v_ref)):
self.assertEqual(x.grad, x_ref.grad, atol=10.0, rtol=0.05)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a note note on the tolerances

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"i made them up"
well, scaling things up made the tolerances hard to adjust here and we're just checking for NaN, will add that in comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should just check for NaN?

@eqy
Copy link
Collaborator Author

eqy commented Nov 3, 2025

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 3, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Lucaskabela
Copy link
Contributor

Checking for 2.9.1 tracking - does this close #166211?

@Lucaskabela
Copy link
Contributor

@eqy

@eqy
Copy link
Collaborator Author

eqy commented Nov 3, 2025

No, this is just a test, we need a cudnn front end submodule bump as otherwise the 2.9 brand will fail this test @Lucaskabela

@Lucaskabela
Copy link
Contributor

Got it - @eqy @malfet do we have a PR bumping this we can cherrypick into the 2.9 branch already?

@eqy
Copy link
Collaborator Author

eqy commented Nov 4, 2025

The upgrade to 1.15.0 frontend resolves this, but that might be too invasive. I'll open a PR proposing a frontend 1.12.2 bump.

@eqy
Copy link
Collaborator Author

eqy commented Nov 4, 2025

Feel free to cherrypick this PR btw, as it adds a test that we would want in 2.9.1 after #166912 is merged

pytorch-bot bot pushed a commit that referenced this pull request Nov 4, 2025
Repros without the neeed for specific tensor data.
Should be passing with cuDNN frontend 1.15.0 which current `main` has.

Pull Request resolved: #166570
Approved by: https://github.com/atalman

Co-authored-by: Nikita Shulga <[email protected]>
Co-authored-by: Aaron Gokaslan <[email protected]>
@atalman
Copy link
Contributor

atalman commented Nov 5, 2025

@pytorchbot cherry-pick --onto release/2.9 --fixes "cudnn-frontend test" -c regression

pytorchbot pushed a commit that referenced this pull request Nov 5, 2025
Repros without the neeed for specific tensor data.
Should be passing with cuDNN frontend 1.15.0 which current `main` has.

Pull Request resolved: #166570
Approved by: https://github.com/atalman

Co-authored-by: Nikita Shulga <[email protected]>
Co-authored-by: Aaron Gokaslan <[email protected]>
(cherry picked from commit 71a2e93)
@pytorchbot
Copy link
Collaborator

Cherry picking #166570

The cherry pick PR is at #167121 and it is linked with issue cudnn-frontend test. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

atalman pushed a commit that referenced this pull request Nov 5, 2025
[cuDNN][SDPA] Check-in test for #166211 (#166570)

Repros without the neeed for specific tensor data.
Should be passing with cuDNN frontend 1.15.0 which current `main` has.

Pull Request resolved: #166570
Approved by: https://github.com/atalman



(cherry picked from commit 71a2e93)

Co-authored-by: Eddie Yan <[email protected]>
Co-authored-by: Nikita Shulga <[email protected]>
Co-authored-by: Aaron Gokaslan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: cudnn Related to torch.backends.cudnn, and CuDNN support module: sdpa All things related to torch.nn.functional.scaled_dot_product_attentiion open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants