Skip to content

Conversation

@umanwizard
Copy link
Contributor

The sgemm in cuBLAS 9.0 has some issues with sizes above 2M on Maxwell and Pascal architectures. Warn in this case.

@umanwizard umanwizard requested review from gchanan, ngimel and soumith June 20, 2019 19:17
@pytorchbot pytorchbot added module: cublas Problem related to cublas support module: cuda Related to torch.cuda, and CUDA support in general module: operators labels Jun 20, 2019
@ngimel
Copy link
Collaborator

ngimel commented Jun 20, 2019

I don't see a warning in Hgemm? (It goes through cublasGemmEx, but then it still can call a faulty kernel). Otherwise, LGTM.

cudaDeviceProp* prop = at::cuda::getCurrentDeviceProperties();
if (prop->major == 5 || prop->major == 6) {
if (!alreadyWarned) {
fprintf(stderr, "Matrix multiplication for dimensions larger than 2^21 has known bugs on your combination of CUDA version and device type. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use TORCH_WARN.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree with TORCH_WARN.

On a related note, do we have methods of suppressing warnings in C++ via TORCH_WARN? I generally don't think doing things like only warning once is correct -- python already has a bunch of functionality around controlling repeating warnings, so reinventing it with no configuration it isn't a good idea. I'm not really sure how that applies to C++, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer to TORCH_WARN. I was trying to find a warning macro, but I was looking for THWarn (which doesn't exist) :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gchanan

python already has a bunch of functionality around controlling repeating warnings

Yeah, I agree this isn't a great way of doing it, but I'm not sure what such facilities Python has -- can you elaborate?

Anyway, I went with the pattern of warning once based on a flag because I saw that pattern elsewhere in TH.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TORCH_WARN is nice because it actually triggers a native Python warning, so at least our main frontend users get a pleasant experience. I'm pretty sure that unless you enable Python this decays to printf in C++. It's definitely now nice and we should improve this, but I guess it's not a priority 🤷🏻‍♂️

@umanwizard
Copy link
Contributor Author

Created issue #22078

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangguanheng66 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@zhangguanheng66
Copy link
Contributor

@umanwizard Could you rebase and resolve the merge conflicts? I tried to land the diff and got merge conflict warnings. Thanks.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangguanheng66 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 2, 2019
Summary:
The sgemm in cuBLAS 9.0 has some issues with sizes above 2M on Maxwell and Pascal architectures. Warn in this case.
Pull Request resolved: pytorch/pytorch#22034

Differential Revision: D15949930

Pulled By: zhangguanheng66

fbshipit-source-id: 0af977ec7900c76328d23898071de9c23778ff8b
@facebook-github-bot
Copy link
Contributor

@zhangguanheng66 merged this pull request in 474dec4.

xzhu1900 pushed a commit to xzhu1900/pytorch that referenced this pull request Jul 5, 2019
Summary:
The sgemm in cuBLAS 9.0 has some issues with sizes above 2M on Maxwell and Pascal architectures. Warn in this case.
Pull Request resolved: pytorch#22034

Differential Revision: D15949930

Pulled By: zhangguanheng66

fbshipit-source-id: 0af977ec7900c76328d23898071de9c23778ff8b
@colesbury
Copy link
Member

@zhangguanheng66 the merge deleted the only call to checkCuda90Bug so the PR has no effect. Can you fix this?

@zhangguanheng66
Copy link
Contributor

@colesbury Sure. I will submit a PR later. So I just need to do a checkCuda90Bug check in THCudaBlas_Sgemm and THCudaBlas_Hgemm, if my understanding is correct.

@colesbury
Copy link
Member

Yes, I think so

facebook-github-bot pushed a commit that referenced this pull request Jul 12, 2019
Summary:
Initiate checkCuda90Bug warning to THCudaBlas_Sgemm and THCudaBlas_Hgemm.
#22034
Pull Request resolved: #22757

Differential Revision: D16223085

Pulled By: zhangguanheng66

fbshipit-source-id: 470c6cbaba16a3cec295993c2673f02008a602a6
zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 12, 2019
Summary:
Initiate checkCuda90Bug warning to THCudaBlas_Sgemm and THCudaBlas_Hgemm.
pytorch/pytorch#22034
Pull Request resolved: pytorch/pytorch#22757

Differential Revision: D16223085

Pulled By: zhangguanheng66

fbshipit-source-id: 470c6cbaba16a3cec295993c2673f02008a602a6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cublas Problem related to cublas support module: cuda Related to torch.cuda, and CUDA support in general open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants