-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add GPU implementation of pdist #11102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
86d4cd1 to
6d259ad
Compare
|
@pytorchbot retest this please |
|
@pytorchbot retest this please |
1 similar comment
|
@pytorchbot retest this please |
|
@pytorchbot retest this please |
Test Plan:
python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist
Hipify had a bug where when the second CUDA kernel arguemnt was at the end of the kernel, it would set the end of that argument one character too soon. This fixes that bug by adding one to the end if the current character is not a comma, e.g. it's at the end of the kernel string.
colesbury
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, just a few comments about code style.
In general, we only label pointer and reference types as const
|
|
||
| namespace { | ||
|
|
||
| static const int warp_size = 32; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| } | ||
|
|
||
| // Reduce warps | ||
| for (int offset = warp_size / 2; offset > 0; offset /= 2) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| const int vert_per_block = 4; | ||
| const int horiz_blocks = (m + horiz_per_block * 8 - 1) / (horiz_per_block * 8); | ||
| const int vert_blocks = (dist.numel() + vert_per_block - 1) / vert_per_block; | ||
| const dim3 blocks(horiz_blocks, vert_blocks); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@colesbury When you're talking about |
|
@pytorchbot retest this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
erikbrinkman is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary:
Add the gpu kernel version.
The parallelism I went with performs poorly when there are a large number of vectors, but they're all short, as I don't allocate the thread pool to wrap in that case.
Test Plan
---------
```
python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist
```
Current performance specs are a little underwhelming, I'm in the process of debugging.
size | torch | torch cuda | scipy
-----|-------|------------|------
16 x 16 | 9.13 µs ± 3.55 µs | 9.86 µs ± 81.5 ns | 15.8 µs ± 1.2 µs
16 x 1024 | 15 µs ± 224 ns | 9.48 µs ± 88.7 ns | 88.7 µs ± 8.83 µs
1024 x 16 | 852 µs ± 6.03 µs | 7.84 ms ± 6.22 µs | 4.7 ms ± 166 µs
1024 x 1024 | 34.1 ms ± 803 µs | 11.5 ms ± 6.24 µs | 273 ms ± 6.7 ms
2048 x 2048 | 261 ms ± 3.5 ms | 77.5 ms ± 41.5 µs | 2.5 s ± 97.6 ms
4096 x 4096 | 2.37 s ± 154 ms | 636 ms ± 2.97 µs | 25.9 s ± 394 ms
Pull Request resolved: pytorch/pytorch#11102
Differential Revision: D9697305
Pulled By: erikbrinkman
fbshipit-source-id: 2b4f4b816c02b3715a85d8db3f4e77479d19bb99
Summary:
Add the gpu kernel version.
The parallelism I went with performs poorly when there are a large number of vectors, but they're all short, as I don't allocate the thread pool to wrap in that case.
Test Plan
---------
```
python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist
```
Current performance specs are a little underwhelming, I'm in the process of debugging.
size | torch | torch cuda | scipy
-----|-------|------------|------
16 x 16 | 9.13 µs ± 3.55 µs | 9.86 µs ± 81.5 ns | 15.8 µs ± 1.2 µs
16 x 1024 | 15 µs ± 224 ns | 9.48 µs ± 88.7 ns | 88.7 µs ± 8.83 µs
1024 x 16 | 852 µs ± 6.03 µs | 7.84 ms ± 6.22 µs | 4.7 ms ± 166 µs
1024 x 1024 | 34.1 ms ± 803 µs | 11.5 ms ± 6.24 µs | 273 ms ± 6.7 ms
2048 x 2048 | 261 ms ± 3.5 ms | 77.5 ms ± 41.5 µs | 2.5 s ± 97.6 ms
4096 x 4096 | 2.37 s ± 154 ms | 636 ms ± 2.97 µs | 25.9 s ± 394 ms
Pull Request resolved: pytorch#11102
Differential Revision: D9697305
Pulled By: erikbrinkman
fbshipit-source-id: 2b4f4b816c02b3715a85d8db3f4e77479d19bb99
Add the gpu kernel version.
The parallelism I went with performs poorly when there are a large number of vectors, but they're all short, as I don't allocate the thread pool to wrap in that case.
Test Plan
Current performance specs are a little underwhelming, I'm in the process of debugging.