Skip to content

Conversation

@colesbury
Copy link
Member

@colesbury colesbury commented May 24, 2019

This fixes advanced indexing in cases where there's more than 2^31-1
bytes in the output. The gpu_index_kernel was missing the
can_use_32bit_indexing/with_32bit_indexing check.

This also adds a number of TORCH_INTERNAL_ASSERTS in Loops.cuh,
OffsetCalculator, and IntDivider that sizes are fit in a signed 32-bit
integer.

More comprehensive tests that require a 32 GB GPU are here:
https://gist.github.com/colesbury/e29387f5851521256dff562be07b981e

Fixes #20888

This fixes advanced indexing in cases where there's more than 2^31-1
bytes in the output. The `gpu_index_kernel` was missing the
`can_use_32bit_indexing`/`with_32bit_indexing` check.

This also adds a number of TORCH_INTERNAL_ASSERTS in Loops.cuh,
OffsetCalculator, and IntDivider that sizes are fit in a signed 32-bit
integer.

More comprehensive tests that require a 32 GB GPU are here:
https://gist.github.com/colesbury/e29387f5851521256dff562be07b981e

Fixes pytorch#20888
@pytorchbot pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: operators labels May 24, 2019
@colesbury colesbury requested a review from gchanan May 24, 2019 20:06
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request May 25, 2019
Summary:
This fixes advanced indexing in cases where there's more than 2^31-1
bytes in the output. The `gpu_index_kernel` was missing the
`can_use_32bit_indexing`/`with_32bit_indexing` check.

This also adds a number of TORCH_INTERNAL_ASSERTS in Loops.cuh,
OffsetCalculator, and IntDivider that sizes are fit in a signed 32-bit
integer.

More comprehensive tests that require a 32 GB GPU are here:
https://gist.github.com/colesbury/e29387f5851521256dff562be07b981e

Fixes #20888
Pull Request resolved: pytorch/pytorch#20919

Differential Revision: D15501945

Pulled By: colesbury

fbshipit-source-id: e876e678e866d2efda8ee92c47a1d2d1310671f0
@facebook-github-bot
Copy link
Contributor

@colesbury merged this pull request in b93bdf6.

@ezyang
Copy link
Contributor

ezyang commented May 25, 2019

Looks like this durably broke ROCm https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/22251//console (or you were unlucky and an upgrade happened at the same time)

04:17:28 test_multinomial (__main__.TestCuda) ... ### HCC STATUS_CHECK Error: HSA_STATUS_ERROR_EXCEPTION (0x1016) at file:mcwamp_hsa.cpp line:1193
04:17:42 Traceback (most recent call last):
04:17:42   File "test/run_test.py", line 441, in <module>
04:17:42     main()
04:17:42   File "test/run_test.py", line 433, in main
04:17:42     raise RuntimeError(message)
04:17:42 RuntimeError: test_cuda failed! Received signal: SIGIOT

@ezyang
Copy link
Contributor

ezyang commented May 25, 2019

cc @bddppq @iotamudelta

@colesbury
Copy link
Member Author

I think the test failure is unrelated to this change

colesbury added a commit to colesbury/pytorch that referenced this pull request May 25, 2019
@colesbury
Copy link
Member Author

Well... maybe not. I'll see if #20948 fixes the ROCm build

@bddppq
Copy link
Contributor

bddppq commented May 28, 2019

There was no rocm upgrade recently. Although I don't see anything obviously suspicious in this diff either :(.
Since this has broken master, let's revert it. I will work with @colesbury to locally repro and further debug it tomorrow.

colesbury added a commit to colesbury/pytorch that referenced this pull request May 28, 2019
This pytorch#20919 without the changes to aten/src/THC/THCIntegerDivider.cuh
that broke the ROCm build.

Original summary:

This fixes advanced indexing in cases where there's more than 2^31-1
bytes in the output. The `gpu_index_kernel` was missing the
`can_use_32bit_indexing`/`with_32bit_indexing` check.

This also adds a number of TORCH_INTERNAL_ASSERTS in Loops.cuh,
OffsetCalculator, and IntDivider that sizes are fit in a signed 32-bit
integer.

More comprehensive tests that require a 32 GB GPU are here:
https://gist.github.com/colesbury/e29387f5851521256dff562be07b981e
colesbury added a commit to colesbury/pytorch that referenced this pull request May 28, 2019
This is the change to aten/src/THC/THCIntegerDivider.cuh from pytorch#20919
in order to debug why it breaks the ROCm multinomial test.
facebook-github-bot pushed a commit that referenced this pull request May 28, 2019
Summary:
This #20919 without the changes to aten/src/THC/THCIntegerDivider.cuh
that broke the ROCm build.

cc bddppq

Original summary:

This fixes advanced indexing in cases where there's more than 2^31-1
bytes in the output. The `gpu_index_kernel` was missing the
`can_use_32bit_indexing`/`with_32bit_indexing` check.

This also adds a number of TORCH_INTERNAL_ASSERTS in Loops.cuh,
OffsetCalculator, and IntDivider that sizes are fit in a signed 32-bit
integer.

More comprehensive tests that require a 32 GB GPU are here:
https://gist.github.com/colesbury/e29387f5851521256dff562be07b981e
Pull Request resolved: #21019

Differential Revision: D15518477

Pulled By: colesbury

fbshipit-source-id: 4db5626fda76eb58250793e8aa7d4f2832db3a34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cuda Related to torch.cuda, and CUDA support in general

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reindexing a huge tensor to shuffle it results in data loss at the end of the tensor

7 participants