Add GPU implementation of pdist #11102

erikbrinkman · 2018-08-30T19:11:31Z

Add the gpu kernel version.

The parallelism I went with performs poorly when there are a large number of vectors, but they're all short, as I don't allocate the thread pool to wrap in that case.

Test Plan

python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist

Current performance specs are a little underwhelming, I'm in the process of debugging.

size	torch	torch cuda	scipy
16 x 16	9.13 µs ± 3.55 µs	9.86 µs ± 81.5 ns	15.8 µs ± 1.2 µs
16 x 1024	15 µs ± 224 ns	9.48 µs ± 88.7 ns	88.7 µs ± 8.83 µs
1024 x 16	852 µs ± 6.03 µs	7.84 ms ± 6.22 µs	4.7 ms ± 166 µs
1024 x 1024	34.1 ms ± 803 µs	11.5 ms ± 6.24 µs	273 ms ± 6.7 ms
2048 x 2048	261 ms ± 3.5 ms	77.5 ms ± 41.5 µs	2.5 s ± 97.6 ms
4096 x 4096	2.37 s ± 154 ms	636 ms ± 2.97 µs	25.9 s ± 394 ms

ezyang · 2018-09-04T16:21:25Z

@pytorchbot retest this please

ezyang · 2018-09-04T16:47:06Z

@pytorchbot retest this please

ezyang · 2018-09-04T16:56:58Z

@pytorchbot retest this please

erikbrinkman · 2018-09-04T17:26:47Z

@pytorchbot retest this please

Test Plan: python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist

tools/amd_build/pyHIPIFY/hipify-python.py

Hipify had a bug where when the second CUDA kernel arguemnt was at the end of the kernel, it would set the end of that argument one character too soon. This fixes that bug by adding one to the end if the current character is not a comma, e.g. it's at the end of the kernel string.

colesbury

looks good, just a few comments about code style.

In general, we only label pointer and reference types as const

aten/src/ATen/native/cuda/DistanceKernel.cu

+
+namespace {
+
+static const int warp_size = 32;


aten/src/ATen/native/cuda/DistanceKernel.cu

+  }
+
+  // Reduce warps
+  for (int offset = warp_size / 2; offset > 0; offset /= 2) {


aten/src/ATen/native/cuda/DistanceKernel.cu

+  const int vert_per_block = 4;
+  const int horiz_blocks = (m + horiz_per_block * 8 - 1) / (horiz_per_block * 8);
+  const int vert_blocks = (dist.numel() + vert_per_block - 1) / vert_per_block;
+  const dim3 blocks(horiz_blocks, vert_blocks);


erikbrinkman · 2018-09-06T18:26:11Z

@colesbury When you're talking about const, are you talking about the const scalar_t? I feel like this is important for me to know that I'm not modifying it. Do you want me to remove them, or is there a different const object I should change?

erikbrinkman · 2018-09-07T00:00:51Z

@pytorchbot retest this please

facebook-github-bot

erikbrinkman is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Add the gpu kernel version. The parallelism I went with performs poorly when there are a large number of vectors, but they're all short, as I don't allocate the thread pool to wrap in that case. Test Plan --------- ``` python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist ``` Current performance specs are a little underwhelming, I'm in the process of debugging. size | torch | torch cuda | scipy -----|-------|------------|------ 16 x 16 | 9.13 µs ± 3.55 µs | 9.86 µs ± 81.5 ns | 15.8 µs ± 1.2 µs 16 x 1024 | 15 µs ± 224 ns | 9.48 µs ± 88.7 ns | 88.7 µs ± 8.83 µs 1024 x 16 | 852 µs ± 6.03 µs | 7.84 ms ± 6.22 µs | 4.7 ms ± 166 µs 1024 x 1024 | 34.1 ms ± 803 µs | 11.5 ms ± 6.24 µs | 273 ms ± 6.7 ms 2048 x 2048 | 261 ms ± 3.5 ms | 77.5 ms ± 41.5 µs | 2.5 s ± 97.6 ms 4096 x 4096 | 2.37 s ± 154 ms | 636 ms ± 2.97 µs | 25.9 s ± 394 ms Pull Request resolved: pytorch/pytorch#11102 Differential Revision: D9697305 Pulled By: erikbrinkman fbshipit-source-id: 2b4f4b816c02b3715a85d8db3f4e77479d19bb99

Summary: Add the gpu kernel version. The parallelism I went with performs poorly when there are a large number of vectors, but they're all short, as I don't allocate the thread pool to wrap in that case. Test Plan --------- ``` python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist ``` Current performance specs are a little underwhelming, I'm in the process of debugging. size | torch | torch cuda | scipy -----|-------|------------|------ 16 x 16 | 9.13 µs ± 3.55 µs | 9.86 µs ± 81.5 ns | 15.8 µs ± 1.2 µs 16 x 1024 | 15 µs ± 224 ns | 9.48 µs ± 88.7 ns | 88.7 µs ± 8.83 µs 1024 x 16 | 852 µs ± 6.03 µs | 7.84 ms ± 6.22 µs | 4.7 ms ± 166 µs 1024 x 1024 | 34.1 ms ± 803 µs | 11.5 ms ± 6.24 µs | 273 ms ± 6.7 ms 2048 x 2048 | 261 ms ± 3.5 ms | 77.5 ms ± 41.5 µs | 2.5 s ± 97.6 ms 4096 x 4096 | 2.37 s ± 154 ms | 636 ms ± 2.97 µs | 25.9 s ± 394 ms Pull Request resolved: pytorch#11102 Differential Revision: D9697305 Pulled By: erikbrinkman fbshipit-source-id: 2b4f4b816c02b3715a85d8db3f4e77479d19bb99

erikbrinkman requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners August 30, 2018 19:11

vadimkantorov mentioned this pull request Aug 30, 2018

[pytorch] [feature request] Pairwise distances between all points in a set (a true pdist) #9406

Closed

erikbrinkman force-pushed the master branch 3 times, most recently from 86d4cd1 to 6d259ad Compare September 4, 2018 15:54

erikbrinkman force-pushed the master branch from 6d259ad to 1ae22cf Compare September 4, 2018 16:40

erikbrinkman force-pushed the master branch from 1ae22cf to bb34301 Compare September 4, 2018 17:04

erikbrinkman force-pushed the master branch from 9cffdb7 to 80d6fb9 Compare September 4, 2018 19:44

erikbrinkman added 5 commits September 4, 2018 14:54

Add GPU implementation of pdist

d7212fb

Test Plan: python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist

Switch to kernel dispatch in hopes to fix build errors

e2f94b6

Limit cuda tests

0ef113b

Refactor to fix build errors

ee7a6e5

Fix windows / ROCM build issues

d1d994d

erikbrinkman force-pushed the master branch from 80d6fb9 to d103f87 Compare September 5, 2018 00:22

ezyang reviewed Sep 6, 2018

View reviewed changes

tools/amd_build/pyHIPIFY/hipify-python.py Outdated Show resolved Hide resolved

erikbrinkman added 4 commits September 5, 2018 22:59

Attempt to fix ROCM sqrt

ef26b43

Another attempted fix for rocm

5e60a1a

Ignore tests on ROCM because ROCM

27dcdb8

erikbrinkman force-pushed the master branch from c92f6cd to 27dcdb8 Compare September 6, 2018 06:01

colesbury approved these changes Sep 6, 2018

View reviewed changes

Address PR style

32aafbf

facebook-github-bot reviewed Sep 7, 2018

View reviewed changes

facebook-github-bot closed this in 91089a7 Sep 7, 2018

ezyang added the merged label Jun 26, 2019

Add GPU implementation of pdist #11102

Add GPU implementation of pdist #11102

Uh oh!

Conversation

erikbrinkman commented Aug 30, 2018

Test Plan

Uh oh!

ezyang commented Sep 4, 2018

Uh oh!

ezyang commented Sep 4, 2018

Uh oh!

ezyang commented Sep 4, 2018

Uh oh!

erikbrinkman commented Sep 4, 2018

Uh oh!

Uh oh!

colesbury left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

erikbrinkman commented Sep 6, 2018

Uh oh!

erikbrinkman commented Sep 7, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants