Skip to content

Conversation

@erikbrinkman
Copy link
Contributor

Add the gpu kernel version.

The parallelism I went with performs poorly when there are a large number of vectors, but they're all short, as I don't allocate the thread pool to wrap in that case.

Test Plan

python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist

Current performance specs are a little underwhelming, I'm in the process of debugging.

size torch torch cuda scipy
16 x 16 9.13 µs ± 3.55 µs 9.86 µs ± 81.5 ns 15.8 µs ± 1.2 µs
16 x 1024 15 µs ± 224 ns 9.48 µs ± 88.7 ns 88.7 µs ± 8.83 µs
1024 x 16 852 µs ± 6.03 µs 7.84 ms ± 6.22 µs 4.7 ms ± 166 µs
1024 x 1024 34.1 ms ± 803 µs 11.5 ms ± 6.24 µs 273 ms ± 6.7 ms
2048 x 2048 261 ms ± 3.5 ms 77.5 ms ± 41.5 µs 2.5 s ± 97.6 ms
4096 x 4096 2.37 s ± 154 ms 636 ms ± 2.97 µs 25.9 s ± 394 ms

@ezyang
Copy link
Contributor

ezyang commented Sep 4, 2018

@pytorchbot retest this please

@ezyang
Copy link
Contributor

ezyang commented Sep 4, 2018

@pytorchbot retest this please

1 similar comment
@ezyang
Copy link
Contributor

ezyang commented Sep 4, 2018

@pytorchbot retest this please

@erikbrinkman
Copy link
Contributor Author

@pytorchbot retest this please

Test Plan:
python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist
Hipify had a bug where when the second CUDA kernel arguemnt was at the
end of the kernel, it would set the end of that argument one character
too soon. This fixes that bug by adding one to the end if the current
character is not a comma, e.g. it's at the end of the kernel string.
Copy link
Member

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just a few comments about code style.

In general, we only label pointer and reference types as const


namespace {

static const int warp_size = 32;

This comment was marked as off-topic.

This comment was marked as off-topic.

}

// Reduce warps
for (int offset = warp_size / 2; offset > 0; offset /= 2) {

This comment was marked as off-topic.

This comment was marked as off-topic.

const int vert_per_block = 4;
const int horiz_blocks = (m + horiz_per_block * 8 - 1) / (horiz_per_block * 8);
const int vert_blocks = (dist.numel() + vert_per_block - 1) / vert_per_block;
const dim3 blocks(horiz_blocks, vert_blocks);

This comment was marked as off-topic.

@erikbrinkman
Copy link
Contributor Author

@colesbury When you're talking about const, are you talking about the const scalar_t? I feel like this is important for me to know that I'm not modifying it. Do you want me to remove them, or is there a different const object I should change?

@erikbrinkman
Copy link
Contributor Author

@pytorchbot retest this please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

erikbrinkman is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Sep 7, 2018
Summary:
Add the gpu kernel version.

The parallelism I went with performs poorly when there are a large number of vectors, but they're all short, as I don't allocate the thread pool to wrap in that case.

Test Plan
---------
```
python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist
```

Current performance specs are a little underwhelming, I'm in the process of debugging.

size | torch | torch cuda | scipy
-----|-------|------------|------
16 x 16 | 9.13 µs ± 3.55 µs | 9.86 µs ± 81.5 ns | 15.8 µs ± 1.2 µs
16 x 1024 | 15 µs ± 224 ns | 9.48 µs ± 88.7 ns | 88.7 µs ± 8.83 µs
1024 x 16 | 852 µs ± 6.03 µs | 7.84 ms ± 6.22 µs | 4.7 ms ± 166 µs
1024 x 1024 | 34.1 ms ± 803 µs | 11.5 ms ± 6.24 µs | 273 ms ± 6.7 ms
2048 x 2048 | 261 ms ± 3.5 ms | 77.5 ms ± 41.5 µs | 2.5 s ± 97.6 ms
4096 x 4096 | 2.37 s ± 154 ms | 636 ms ± 2.97 µs | 25.9 s ± 394 ms
Pull Request resolved: pytorch/pytorch#11102

Differential Revision: D9697305

Pulled By: erikbrinkman

fbshipit-source-id: 2b4f4b816c02b3715a85d8db3f4e77479d19bb99
PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request Sep 11, 2018
Summary:
Add the gpu kernel version.

The parallelism I went with performs poorly when there are a large number of vectors, but they're all short, as I don't allocate the thread pool to wrap in that case.

Test Plan
---------
```
python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist
```

Current performance specs are a little underwhelming, I'm in the process of debugging.

size | torch | torch cuda | scipy
-----|-------|------------|------
16 x 16 | 9.13 µs ± 3.55 µs | 9.86 µs ± 81.5 ns | 15.8 µs ± 1.2 µs
16 x 1024 | 15 µs ± 224 ns | 9.48 µs ± 88.7 ns | 88.7 µs ± 8.83 µs
1024 x 16 | 852 µs ± 6.03 µs | 7.84 ms ± 6.22 µs | 4.7 ms ± 166 µs
1024 x 1024 | 34.1 ms ± 803 µs | 11.5 ms ± 6.24 µs | 273 ms ± 6.7 ms
2048 x 2048 | 261 ms ± 3.5 ms | 77.5 ms ± 41.5 µs | 2.5 s ± 97.6 ms
4096 x 4096 | 2.37 s ± 154 ms | 636 ms ± 2.97 µs | 25.9 s ± 394 ms
Pull Request resolved: pytorch#11102

Differential Revision: D9697305

Pulled By: erikbrinkman

fbshipit-source-id: 2b4f4b816c02b3715a85d8db3f4e77479d19bb99
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants