Skip to content

Conversation

@soumith
Copy link
Contributor

@soumith soumith commented Oct 31, 2019

as part of pytorch/hub#62 I found that the stack-trace of a failed kernel launch was being recorded elsewhere, even with CUDA_LAUNCH_BLOCKING=1.

So, I started debugging, and found that magma launches don't do error checking.

I eventually found the issue to be that I didn't compile-in sm37 SASS into the magma binary and the failure was on x.inverse(), and that's somehow a problem for magma 2.5.1 (but not 2.5.0).

@soumith soumith requested a review from vishwakftw October 31, 2019 22:06
Copy link
Contributor

@vishwakftw vishwakftw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the error checks are sufficient in magma<function_name> definitions, unless you think otherwise.

For the inverse failure, the source of failure could have been the sswap call in magma_sgetri_gpu. Curiously, I grepped to see if there was any error checking in place, and there isn't (weirdly enough).

infos[i] = info_array[i];
}
}
AT_CUDA_CHECK(cudaGetLastError());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it be required to do it here?

for (int64_t i = 0; i < batch_size; i++) {
infos[i] = info_array[i];
}
AT_CUDA_CHECK(cudaGetLastError());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise

magmaGetri<scalar_t>(
n, self_data, n, ipiv.data_ptr<magma_int_t>(), dwork.data_ptr<scalar_t>(), lwork, &info_tmp);
info = info_tmp;
AT_CUDA_CHECK(cudaGetLastError());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise

@soumith
Copy link
Contributor Author

soumith commented Nov 8, 2019

thanks for the review, made changes. sorry for the delay.

Copy link
Contributor

@vishwakftw vishwakftw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding these @soumith, no problems about the delay.

Is there a way to test this?

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Nov 8, 2019
Summary:
as part of pytorch/hub#62 I found that the stack-trace of a failed kernel launch was being recorded elsewhere, even with CUDA_LAUNCH_BLOCKING=1.

So, I started debugging, and found that magma launches don't do error checking.

I eventually found the issue to be that I didn't compile-in sm37 SASS into the magma binary and the failure was on `x.inverse()`, and that's somehow a problem for magma 2.5.1 (but not 2.5.0).
Pull Request resolved: pytorch/pytorch#29003

Differential Revision: D18397358

Pulled By: soumith

fbshipit-source-id: 04baca68eac209d7af773daddd0193697d4ab0d9
@facebook-github-bot
Copy link
Contributor

@soumith merged this pull request in f441bb1.

@facebook-github-bot facebook-github-bot deleted the magma_launch branch July 13, 2020 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants