Cuda persistent softmax #20827

thorjohnsen · 2019-05-22T22:24:36Z

Adds persistent cuda kernels that speed up SoftMax applied over the fast dimension, i.e. torch.nn.Softmax(dim=-1) and torch.nn.LogSoftmax(dim=-1). When the size is <= 1024, this code is 2-10x faster than the current code, speedup is higher for smaller sizes. This code works for half, float and double tensors with 1024 or fewer elements in the fast dimension. Numerical accuracy is on par with the current code, i.e. relative error is ~1e-8 for float tensors and ~1e-17 for double tensors. Relative error was computed against the CPU code.

The attached image shows kernel time in us for torch.nn.Softmax(dim=-1) applied to a half precision tensor of shape [16384,n], n is plotted along the horizontal axis. Similar uplifts can be seen for the backward pass and for LogSoftmax.

soumith · 2019-05-23T02:03:35Z

this is really cool!

ngimel · 2019-05-23T15:56:08Z

Test failures are real.

thorjohnsen · 2019-05-23T16:10:22Z

Yeah. They are intermittent. I ran one of the failed tests (test_softmax_dtype) two times in a row. It failed the first time and passed the second time. I'm looking into it.

thorjohnsen · 2019-05-23T16:29:00Z

I'm pretty sure I introduced this bug when I did the code cleanup, the input arrays were no longer being properly initialized. I fixed it, hopefully the tests will pass now.

ngimel

Some small changes, overall looks good.

ngimel · 2019-05-23T15:58:39Z

aten/src/ATen/native/cuda/PersistentSoftmax.cuh

+        }
+    }
+
+    constexpr uint32_t  FULL_MASK = 0xffffffff;


You don't need to define full mask, WARP_SHFL_XOR will by default use full mask.

I will remove FULL_MASK.

ngimel · 2019-05-23T15:59:30Z

aten/src/ATen/native/cuda/PersistentSoftmax.cuh

+// Warp Softmax forward
+////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// WARP_BATCH number of batches.


This is a misleading comment - it is number of samples?

I kind of inherited the comments. I agree they are not very clear, I'll try to improve that.

ngimel · 2019-05-23T16:00:24Z

aten/src/ATen/native/cuda/PersistentSoftmax.cuh

+
+// WARP_BATCH number of batches.
+// WARP_ITERATOINS The number of iterations required for one warp to iterate over all data.
+// WARP_SIZE number of elements working on a single batch, has to be a power of two.


number of threads working on a single sample?

ngimel · 2019-05-23T16:04:15Z

aten/src/ATen/native/cuda/PersistentSoftmax.cuh

+    for (int offset = WARP_SIZE / 2; offset > 0; offset /= 2) {
+        acc_t val[WARP_BATCH];
+        #pragma unroll
+        for (int i = 0;i < WARP_BATCH;++i) {


Do 2 loops here provide perf benefits, or can they be fused like?

val = WARP_SHFL_XOR(..) max_value[i] = <ternary operator>

If 2 loops are indeed necessary, comment might be in order.

ngimel · 2019-05-23T16:07:23Z

aten/src/ATen/native/cuda/PersistentSoftmax.cuh

+
+    // reduction sum
+    #pragma unroll
+    for (int offset = WARP_SIZE / 2; offset > 0; offset /= 2) {


I think it would make sense to have a __ device __ function for warp reduce, as it is used a few times. It could also handle max reduction.

ngimel · 2019-05-23T16:28:04Z

aten/src/ATen/native/cuda/PersistentSoftmax.cuh

+        kernel<<<blocks, threads>>>(dst, src, batch_count, softmax_elements_stride, softmax_elements);
+        return true;
+    }
+    return false;


might make sense to do TORCH_INTERNAL_ASSERT on softmax_elements <=1024.

That's actually a better idea than asserting on the return value from dispatch_softmax. Will do.

ngimel · 2019-05-23T16:29:34Z

aten/src/ATen/native/cuda/PersistentSoftmax.cuh

+    } else if (softmax_elements <= 1024) {
+        // compute function index. there's a function for each power of two size up to 1024.
+        int log2_elements = 0;
+        while ((1 << log2_elements) < softmax_elements) ++log2_elements;


@ezyang, @mcarilli the functions for next power of 2 are often needed - does it make sense to make those helper functions generally available?

ngimel · 2019-05-23T16:31:27Z

aten/src/ATen/native/cuda/PersistentSoftmax.cuh

+    }
+
+    // reduction sum
+    constexpr uint32_t FULL_MASK = 0xffffffff;


same comment about FULL_MASK

ngimel · 2019-05-23T16:33:37Z

aten/src/ATen/native/cuda/PersistentSoftmax.cuh

+        }
+
+        // use 128 threads per block to maximimize gpu utilization
+        constexpr int threads_per_block = 128;


It looks like all these computationas are shared between forward/backward, so may be they can be abstracted away?

ngimel · 2019-05-23T16:35:46Z

aten/src/ATen/native/cuda/SoftMax.cu


 Tensor log_softmax_cuda(const Tensor &input, const int64_t dim, const bool half_to_float){
-  return host_softmax<LogSoftMaxForwardEpilogue>(input, dim, half_to_float);
+  return host_softmax<LogSoftMaxForwardEpilogue,true>(input, dim, half_to_float);


I'm not too happy with 2 template arguments saying the same thing (I'm log! I'm not!), but since I did not come up with a not too ugly way to get rid of it, I'll let it slide.

aten/src/ATen/native/cuda/PersistentSoftmax.cuh

thorjohnsen · 2019-05-23T18:57:45Z

One of the softmax tests is still failing on accuracy, but it looks like that is due to a kluge in Pytorch. The test that fails runs softmax on a double precision tensor, but sees float like accuracy (i.e. relative error ~1e-8). The test was compiled with -D__HIP_PLATFORM_HCC__=1. My code uses WARP_SHFL_XOR for intrawarp reductions, if you look at the place where this is defined, there is a comment saying "HIP does not support double" and a specialization of WARP_SHFL_XOR that casts the value to a float before calling __shfl_xor. No test that uses WARP_SHFL_XOR will ever pass the double precision accuracy tests in test_nn.py because of this @iotamudelta.

pytorch/aten/src/THC/THCDeviceUtils.cuh

Lines 63 to 68 in 31e2d20

    
           #ifdef __HIP_PLATFORM_HCC__ 
        
           //To handle ambiguity, add a type double version. 
        
           __device__ __forceinline__ double WARP_SHFL_XOR(double value, int laneMask, int width = warpSize, unsigned int mask = 0xffffffff) { 
        
             //(HIP doesn't support double) 
        
             return (double) __shfl_xor((float) value, laneMask, width); 
        
           }

iotamudelta · 2019-05-23T19:04:19Z

@thorjohnsen A double __shfl_xor(double, int, int) is defined for ROCm. This specialization should be removable.

thorjohnsen · 2019-05-24T02:56:31Z

71 tests passed, 4 failed. After Inspection, I don't think any of the 4 failed tests are related to this PR, but please don't take my word for it. Removing the double specialization of WARP_SHFL_XOR like you suggested fixed the failing Softmax test and a couple of other unrelated unit tests are also passing now, so we might have accidentally fixed a bug @iotamudelta. I will push the new code with your suggested improvements tomorrow morning @ngimel, then I think we should be ready for the merge.

ngimel · 2019-05-29T00:49:27Z

My comments are addressed, thank you.

thorjohnsen · 2019-05-29T16:37:35Z

The 3 failed tests appear unrelated to this PR. One test failed with ImportError torch.onnx.symbolic_helper, the 2nd test failed with ImportError undefined symbol: _ZN2at19NonVariableTypeMode10is_enabledEv and the 3rd test failed because of a java.io.IOException: Backing channel 'JNLP4-connect connection from 209.249.227.2/209.249.227.2:43242' is disconnected. I believe this PR is ready to be merged.

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Adds persistent cuda kernels that speed up SoftMax applied over the fast dimension, i.e. torch.nn.Softmax(dim=-1) and torch.nn.LogSoftmax(dim=-1). When the size is <= 1024, this code is 2-10x faster than the current code, speedup is higher for smaller sizes. This code works for half, float and double tensors with 1024 or fewer elements in the fast dimension. Numerical accuracy is on par with the current code, i.e. relative error is ~1e-8 for float tensors and ~1e-17 for double tensors. Relative error was computed against the CPU code. The attached image shows kernel time in us for torch.nn.Softmax(dim=-1) applied to a half precision tensor of shape [16384,n], n is plotted along the horizontal axis. Similar uplifts can be seen for the backward pass and for LogSoftmax. ![image](https://user-images.githubusercontent.com/41591019/58212822-b63ebb00-7cb5-11e9-910d-1fc7d8585d58.png) Pull Request resolved: pytorch/pytorch#20827 Differential Revision: D15582509 Pulled By: ezyang fbshipit-source-id: 65805db37487cebbc4ceefb1a1bd486d24745f80

facebook-github-bot · 2019-06-01T02:41:20Z

@ezyang merged this pull request in e098878.

thorjohnsen added 3 commits May 20, 2019 09:36

Draft release of persistent softmax for CUDA

8f661c8

Delete un-necessary test file

35ae167

Fix accuracy issue for double tensors

546f581

pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: operators labels May 22, 2019

thorjohnsen added 2 commits May 22, 2019 16:40

Convert tabs to spaces

072997d

Change __shfl_xor_sync to WARP_SHFL_XOR

f761ec8

Fix embarrasing initialization bug

13594ee

ngimel approved these changes May 23, 2019

View reviewed changes

ngimel reviewed May 23, 2019

View reviewed changes

aten/src/ATen/native/cuda/PersistentSoftmax.cuh Show resolved Hide resolved

Remove unnecessary init to infinity

99e1ad1

Remove double specialization of WARP_SHFL_XOR that fails accuracy tests

73ed694

Address reviewer concerns

380e86a

facebook-github-bot reviewed May 31, 2019

View reviewed changes

facebook-github-bot closed this in e098878 May 31, 2019

facebook-github-bot added the merged label Jun 1, 2019

ezyang added the open source label Jun 24, 2019

mruberry added the Merged label Oct 28, 2020

Cuda persistent softmax #20827

Cuda persistent softmax #20827

Uh oh!

Conversation

thorjohnsen commented May 22, 2019

Uh oh!

soumith commented May 23, 2019

Uh oh!

ngimel commented May 23, 2019

Uh oh!

thorjohnsen commented May 23, 2019

Uh oh!

thorjohnsen commented May 23, 2019

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thorjohnsen commented May 23, 2019

Uh oh!

iotamudelta commented May 23, 2019

Uh oh!

thorjohnsen commented May 24, 2019

Uh oh!

ngimel commented May 29, 2019

Uh oh!

thorjohnsen commented May 29, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants