-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[cuDNN][SDPA] Check-in test for #166211 #166570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166570
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit a9a6d97 with merge base fc540ce ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
atalman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Thank you
| @skipIfRocm | ||
| @unittest.skipIf(not PLATFORM_SUPPORTS_CUDNN_ATTENTION, "cudnn Attention is not supported on this system") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to skip the test on those platforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is explicitly testing for a cuDNN bug: with torch.nn.attention.sdpa_kernel(torch.nn.attention.SDPBackend.CUDNN_ATTENTION):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wish we had a strict CUDAOnly for this then.
Co-authored-by: Nikita Shulga <[email protected]>
Co-authored-by: Nikita Shulga <[email protected]>
| k.requires_grad = True | ||
| v.requires_grad = True | ||
|
|
||
| grad_attn_output = torch.randn(*shape, device='cuda', dtype=torch.bfloat16) * scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use torch.autograd.grad to reuse the exact input tensors
test/test_transformers.py
Outdated
| attn_output.backward(grad_attn_output) | ||
|
|
||
| for x, x_ref in zip((q, k, v), (q_ref, k_ref, v_ref)): | ||
| self.assertEqual(x.grad, x_ref.grad, atol=10.0, rtol=0.05) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a note note on the tolerances
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"i made them up"
well, scaling things up made the tolerances hard to adjust here and we're just checking for NaN, will add that in comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should just check for NaN?
Co-authored-by: Aaron Gokaslan <[email protected]>
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
Checking for 2.9.1 tracking - does this close #166211? |
|
No, this is just a test, we need a cudnn front end submodule bump as otherwise the 2.9 brand will fail this test @Lucaskabela |
|
The upgrade to 1.15.0 frontend resolves this, but that might be too invasive. I'll open a PR proposing a frontend 1.12.2 bump. |
|
Feel free to cherrypick this PR btw, as it adds a test that we would want in 2.9.1 after #166912 is merged |
Repros without the neeed for specific tensor data. Should be passing with cuDNN frontend 1.15.0 which current `main` has. Pull Request resolved: #166570 Approved by: https://github.com/atalman Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: Aaron Gokaslan <[email protected]>
|
@pytorchbot cherry-pick --onto release/2.9 --fixes "cudnn-frontend test" -c regression |
Repros without the neeed for specific tensor data. Should be passing with cuDNN frontend 1.15.0 which current `main` has. Pull Request resolved: #166570 Approved by: https://github.com/atalman Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: Aaron Gokaslan <[email protected]> (cherry picked from commit 71a2e93)
Cherry picking #166570The cherry pick PR is at #167121 and it is linked with issue cudnn-frontend test. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
[cuDNN][SDPA] Check-in test for #166211 (#166570) Repros without the neeed for specific tensor data. Should be passing with cuDNN frontend 1.15.0 which current `main` has. Pull Request resolved: #166570 Approved by: https://github.com/atalman (cherry picked from commit 71a2e93) Co-authored-by: Eddie Yan <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: Aaron Gokaslan <[email protected]>
Repros without the neeed for specific tensor data.
Should be passing with cuDNN frontend 1.15.0 which current
mainhas.cc @csarofeen @ptrblck @xwang233