check error status of CUDA launch after Magma kernels #29003

soumith · 2019-10-31T22:06:05Z

as part of pytorch/hub#62 I found that the stack-trace of a failed kernel launch was being recorded elsewhere, even with CUDA_LAUNCH_BLOCKING=1.

So, I started debugging, and found that magma launches don't do error checking.

I eventually found the issue to be that I didn't compile-in sm37 SASS into the magma binary and the failure was on x.inverse(), and that's somehow a problem for magma 2.5.1 (but not 2.5.0).

vishwakftw

I think the error checks are sufficient in magma<function_name> definitions, unless you think otherwise.

For the inverse failure, the source of failure could have been the sswap call in magma_sgetri_gpu. Curiously, I grepped to see if there was any error checking in place, and there isn't (weirdly enough).

vishwakftw · 2019-11-01T00:36:49Z

aten/src/ATen/native/cuda/BatchLinearAlgebra.cu

      infos[i] = info_array[i];
    }
  }
+  AT_CUDA_CHECK(cudaGetLastError());


Why would it be required to do it here?

vishwakftw · 2019-11-01T00:36:56Z

aten/src/ATen/native/cuda/BatchLinearAlgebra.cu

  for (int64_t i = 0; i < batch_size; i++) {
    infos[i] = info_array[i];
  }
+  AT_CUDA_CHECK(cudaGetLastError());


vishwakftw · 2019-11-01T00:37:12Z

aten/src/ATen/native/cuda/BatchLinearAlgebra.cu

  magmaGetri<scalar_t>(
    n, self_data, n, ipiv.data_ptr<magma_int_t>(), dwork.data_ptr<scalar_t>(), lwork, &info_tmp);
  info = info_tmp;
+  AT_CUDA_CHECK(cudaGetLastError());


soumith · 2019-11-08T06:30:56Z

thanks for the review, made changes. sorry for the delay.

vishwakftw

Thank you for adding these @soumith, no problems about the delay.

Is there a way to test this?

facebook-github-bot

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: as part of pytorch/hub#62 I found that the stack-trace of a failed kernel launch was being recorded elsewhere, even with CUDA_LAUNCH_BLOCKING=1. So, I started debugging, and found that magma launches don't do error checking. I eventually found the issue to be that I didn't compile-in sm37 SASS into the magma binary and the failure was on `x.inverse()`, and that's somehow a problem for magma 2.5.1 (but not 2.5.0). Pull Request resolved: pytorch/pytorch#29003 Differential Revision: D18397358 Pulled By: soumith fbshipit-source-id: 04baca68eac209d7af773daddd0193697d4ab0d9

facebook-github-bot · 2019-11-08T20:10:54Z

@soumith merged this pull request in f441bb1.

soumith requested a review from vishwakftw October 31, 2019 22:06

vishwakftw reviewed Nov 1, 2019

View reviewed changes

soumith mentioned this pull request Nov 3, 2019

magma functionality isn't working on K80 GPU with official binaries #29096

Closed

soumith added 2 commits November 7, 2019 22:30

check error status of CUDA launch after Magma kernels

ed18287

review changes

c564617

soumith force-pushed the magma_launch branch from ae93027 to c564617 Compare November 8, 2019 06:30

soumith requested a review from vishwakftw November 8, 2019 06:30

vishwakftw approved these changes Nov 8, 2019

View reviewed changes

facebook-github-bot reviewed Nov 8, 2019

View reviewed changes

facebook-github-bot closed this in f441bb1 Nov 8, 2019

facebook-github-bot added the merged label Nov 8, 2019

facebook-github-bot deleted the magma_launch branch July 13, 2020 17:57

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

check error status of CUDA launch after Magma kernels #29003

check error status of CUDA launch after Magma kernels #29003

Uh oh!

soumith commented Oct 31, 2019

Uh oh!

vishwakftw left a comment •

edited

Loading

Uh oh!

vishwakftw Nov 1, 2019

Uh oh!

vishwakftw Nov 1, 2019

Uh oh!

vishwakftw Nov 1, 2019

Uh oh!

soumith commented Nov 8, 2019

Uh oh!

vishwakftw left a comment •

edited

Loading

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Nov 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

check error status of CUDA launch after Magma kernels #29003

check error status of CUDA launch after Magma kernels #29003

Uh oh!

Conversation

soumith commented Oct 31, 2019

Uh oh!

vishwakftw left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vishwakftw Nov 1, 2019

Choose a reason for hiding this comment

Uh oh!

vishwakftw Nov 1, 2019

Choose a reason for hiding this comment

Uh oh!

vishwakftw Nov 1, 2019

Choose a reason for hiding this comment

Uh oh!

soumith commented Nov 8, 2019

Uh oh!

vishwakftw left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Nov 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vishwakftw left a comment •

edited

Loading

vishwakftw left a comment •

edited

Loading