-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Warn on conditions that can trigger cuBLAS sgemm bug #22034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I don't see a warning in Hgemm? (It goes through cublasGemmEx, but then it still can call a faulty kernel). Otherwise, LGTM. |
aten/src/THC/THCBlas.cu
Outdated
| cudaDeviceProp* prop = at::cuda::getCurrentDeviceProperties(); | ||
| if (prop->major == 5 || prop->major == 6) { | ||
| if (!alreadyWarned) { | ||
| fprintf(stderr, "Matrix multiplication for dimensions larger than 2^21 has known bugs on your combination of CUDA version and device type. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use TORCH_WARN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely agree with TORCH_WARN.
On a related note, do we have methods of suppressing warnings in C++ via TORCH_WARN? I generally don't think doing things like only warning once is correct -- python already has a bunch of functionality around controlling repeating warnings, so reinventing it with no configuration it isn't a good idea. I'm not really sure how that applies to C++, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer to TORCH_WARN. I was trying to find a warning macro, but I was looking for THWarn (which doesn't exist) :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python already has a bunch of functionality around controlling repeating warnings
Yeah, I agree this isn't a great way of doing it, but I'm not sure what such facilities Python has -- can you elaborate?
Anyway, I went with the pattern of warning once based on a flag because I saw that pattern elsewhere in TH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TORCH_WARN is nice because it actually triggers a native Python warning, so at least our main frontend users get a pleasant experience. I'm pretty sure that unless you enable Python this decays to printf in C++. It's definitely now nice and we should improve this, but I guess it's not a priority 🤷🏻♂️
|
Created issue #22078 |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangguanheng66 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@umanwizard Could you rebase and resolve the merge conflicts? I tried to land the diff and got merge conflict warnings. Thanks. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangguanheng66 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: The sgemm in cuBLAS 9.0 has some issues with sizes above 2M on Maxwell and Pascal architectures. Warn in this case. Pull Request resolved: pytorch/pytorch#22034 Differential Revision: D15949930 Pulled By: zhangguanheng66 fbshipit-source-id: 0af977ec7900c76328d23898071de9c23778ff8b
|
@zhangguanheng66 merged this pull request in 474dec4. |
Summary: The sgemm in cuBLAS 9.0 has some issues with sizes above 2M on Maxwell and Pascal architectures. Warn in this case. Pull Request resolved: pytorch#22034 Differential Revision: D15949930 Pulled By: zhangguanheng66 fbshipit-source-id: 0af977ec7900c76328d23898071de9c23778ff8b
|
@zhangguanheng66 the merge deleted the only call to |
|
@colesbury Sure. I will submit a PR later. So I just need to do a checkCuda90Bug check in THCudaBlas_Sgemm and THCudaBlas_Hgemm, if my understanding is correct. |
|
Yes, I think so |
Summary: Initiate checkCuda90Bug warning to THCudaBlas_Sgemm and THCudaBlas_Hgemm. pytorch/pytorch#22034 Pull Request resolved: pytorch/pytorch#22757 Differential Revision: D16223085 Pulled By: zhangguanheng66 fbshipit-source-id: 470c6cbaba16a3cec295993c2673f02008a602a6
The sgemm in cuBLAS 9.0 has some issues with sizes above 2M on Maxwell and Pascal architectures. Warn in this case.