Warn on conditions that can trigger cuBLAS sgemm bug #22034

umanwizard · 2019-06-20T19:17:27Z

The sgemm in cuBLAS 9.0 has some issues with sizes above 2M on Maxwell and Pascal architectures. Warn in this case.

ngimel · 2019-06-20T23:03:15Z

I don't see a warning in Hgemm? (It goes through cublasGemmEx, but then it still can call a faulty kernel). Otherwise, LGTM.

apaszke · 2019-06-21T14:52:12Z

aten/src/THC/THCBlas.cu

+    cudaDeviceProp* prop = at::cuda::getCurrentDeviceProperties();
+    if (prop->major == 5 || prop->major == 6) {
+      if (!alreadyWarned) {
+        fprintf(stderr, "Matrix multiplication for dimensions larger than 2^21 has known bugs on your combination of CUDA version and device type. "


Please use TORCH_WARN.

Definitely agree with TORCH_WARN.

On a related note, do we have methods of suppressing warnings in C++ via TORCH_WARN? I generally don't think doing things like only warning once is correct -- python already has a bunch of functionality around controlling repeating warnings, so reinventing it with no configuration it isn't a good idea. I'm not really sure how that applies to C++, though.

Thanks for the pointer to TORCH_WARN. I was trying to find a warning macro, but I was looking for THWarn (which doesn't exist) :)

@gchanan

python already has a bunch of functionality around controlling repeating warnings

Yeah, I agree this isn't a great way of doing it, but I'm not sure what such facilities Python has -- can you elaborate?

Anyway, I went with the pattern of warning once based on a flag because I saw that pattern elsewhere in TH.

See filtering here: https://docs.python.org/3/library/warnings.html.

TORCH_WARN is nice because it actually triggers a native Python warning, so at least our main frontend users get a pleasant experience. I'm pretty sure that unless you enable Python this decays to printf in C++. It's definitely now nice and we should improve this, but I guess it's not a priority 🤷🏻‍♂️

aten/src/THC/THCBlas.cu

umanwizard · 2019-06-21T17:26:28Z

Created issue #22078

facebook-github-bot

@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@zhangguanheng66 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zhangguanheng66 · 2019-07-02T01:58:54Z

@umanwizard Could you rebase and resolve the merge conflicts? I tried to land the diff and got merge conflict warnings. Thanks.

facebook-github-bot

@zhangguanheng66 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: The sgemm in cuBLAS 9.0 has some issues with sizes above 2M on Maxwell and Pascal architectures. Warn in this case. Pull Request resolved: pytorch/pytorch#22034 Differential Revision: D15949930 Pulled By: zhangguanheng66 fbshipit-source-id: 0af977ec7900c76328d23898071de9c23778ff8b

facebook-github-bot · 2019-07-02T23:42:35Z

@zhangguanheng66 merged this pull request in 474dec4.

Summary: The sgemm in cuBLAS 9.0 has some issues with sizes above 2M on Maxwell and Pascal architectures. Warn in this case. Pull Request resolved: pytorch#22034 Differential Revision: D15949930 Pulled By: zhangguanheng66 fbshipit-source-id: 0af977ec7900c76328d23898071de9c23778ff8b

colesbury · 2019-07-11T15:38:25Z

@zhangguanheng66 the merge deleted the only call to checkCuda90Bug so the PR has no effect. Can you fix this?

zhangguanheng66 · 2019-07-11T15:50:19Z

@colesbury Sure. I will submit a PR later. So I just need to do a checkCuda90Bug check in THCudaBlas_Sgemm and THCudaBlas_Hgemm, if my understanding is correct.

colesbury · 2019-07-11T16:23:32Z

Yes, I think so

Summary: Initiate checkCuda90Bug warning to THCudaBlas_Sgemm and THCudaBlas_Hgemm. #22034 Pull Request resolved: #22757 Differential Revision: D16223085 Pulled By: zhangguanheng66 fbshipit-source-id: 470c6cbaba16a3cec295993c2673f02008a602a6

Summary: Initiate checkCuda90Bug warning to THCudaBlas_Sgemm and THCudaBlas_Hgemm. pytorch/pytorch#22034 Pull Request resolved: pytorch/pytorch#22757 Differential Revision: D16223085 Pulled By: zhangguanheng66 fbshipit-source-id: 470c6cbaba16a3cec295993c2673f02008a602a6

Warn on conditions that can trigger cuBLAS sgemm bug

6e33cc4

umanwizard requested review from gchanan, ngimel and soumith June 20, 2019 19:17

pytorchbot added module: cublas Problem related to cublas support module: cuda Related to torch.cuda, and CUDA support in general module: operators labels Jun 20, 2019

Add warning to hgemm and add explanatory comment to checking function.

c659e98

apaszke suggested changes Jun 21, 2019

View reviewed changes

gchanan reviewed Jun 21, 2019

View reviewed changes

aten/src/THC/THCBlas.cu Outdated Show resolved Hide resolved

Brennan Vincent added 2 commits June 21, 2019 10:07

link to issue instead of gist

0bfcefc

use TORCH_WARN and std::call_once

7dd56e4

gchanan approved these changes Jun 21, 2019

View reviewed changes

facebook-github-bot reviewed Jun 21, 2019

View reviewed changes

apaszke approved these changes Jun 22, 2019

View reviewed changes

ezyang added the open source label Jul 1, 2019

facebook-github-bot reviewed Jul 1, 2019

View reviewed changes

Resolve merge conflicts.

2632810

facebook-github-bot reviewed Jul 2, 2019

View reviewed changes

facebook-github-bot closed this in 474dec4 Jul 2, 2019

facebook-github-bot added the merged label Jul 2, 2019

zhangguanheng66 mentioned this pull request Jul 11, 2019

Initiate checkCuda90Bug warning #22757

Closed

umanwizard mentioned this pull request Sep 24, 2019

torch.mm gives wrong results on certain combinations of input size, device type, and cuBLAS version. #22078

Closed

mruberry added the Merged label Oct 28, 2020

Warn on conditions that can trigger cuBLAS sgemm bug #22034

Warn on conditions that can trigger cuBLAS sgemm bug #22034

Uh oh!

Conversation

umanwizard commented Jun 20, 2019

Uh oh!

ngimel commented Jun 20, 2019

Uh oh!

apaszke Jun 21, 2019

Choose a reason for hiding this comment

Uh oh!

gchanan Jun 21, 2019

Choose a reason for hiding this comment

Uh oh!

umanwizard Jun 21, 2019

Choose a reason for hiding this comment

Uh oh!

umanwizard Jun 21, 2019

Choose a reason for hiding this comment

Uh oh!

gchanan Jun 21, 2019

Choose a reason for hiding this comment

Uh oh!

apaszke Jun 22, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

umanwizard commented Jun 21, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

zhangguanheng66 commented Jul 2, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 2, 2019

Uh oh!

colesbury commented Jul 11, 2019

Uh oh!

zhangguanheng66 commented Jul 11, 2019

Uh oh!

colesbury commented Jul 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants