max_pool2d cuda should have channel last optimized kernels[Performance improvement] #24872

ifedan · 2019-08-19T22:57:34Z

max_pool2d_with_indices_cuda and max_pool2d_with_indices_backward_cuda should have channel last optimized kernels(#23815)

This reverts commit c7ece81.

ifedan · 2019-08-20T03:21:37Z

Forward time

ifedan · 2019-08-20T03:23:23Z

Backward time

jjsjann123 · 2019-08-20T03:57:52Z

Two quick thing on the benchmarking:

Backwards looks really bad, are we measuring only the pooling backward kernel or the total backward time?
Could try to hack the code path to remove that TensorIterator call that does the transpose/grad_accumulate, so we can get a fair comparison.
NIT: The graph is a little hard to see clearly what's going on here. Maybe we could list relative speedup instead of the absolute time here.

I'm quite occupied until tomorrow afternoon. Let me take a closer look then.

ifedan · 2019-08-20T04:42:58Z

are we measuring only the pooling backward kernel or the total backward time?

jjsjann123 · 2019-08-20T06:22:44Z

Oops, my bad, messed up the legend on backwards earlier :/ Inferring from the perf we must have already got rid of the permutation?

Looks like we are doing good except for certain problem size where we are getting destroyed by the stock kernel. Let me take another pass tomorrow afternoon.

ifedan · 2019-08-20T16:44:48Z

Oops, my bad, messed up the legend on backwards earlier :/ Inferring from the perf we must have already got rid of the permutation?

Looks like we are doing good except for certain problem size where we are getting destroyed by the stock kernel. Let me take another pass tomorrow afternoon.

Yes, I think we can improve even more here, but the idea, for now, to have a stable version that not worse then NCHW

jjsjann123 · 2019-08-20T23:32:55Z

aten/src/ATen/native/cuda/DilatedMaxPool2d.cu

-static const int BACKWARD_THREADS = 256;
+template <typename scalar_t, typename accscalar_t>
+C10_LAUNCH_BOUNDS_1(CUDA_MAX_THREADS)
+__global__ void MaxPoolForwardNHWC(const int nthreads, const scalar_t* bottom_data,


nthreads & num is not used any more. Same with MaxPoolBackwardNHWC

jjsjann123 · 2019-08-20T23:39:20Z

aten/src/ATen/native/cuda/DilatedMaxPool2d.cu

+          for(int c = threadIdx.x; c < channels; c+= blockDim.x) {
+            scalar_t val = ptr_input[c];
+            scalar_t maxval = out_cached[2 * c];
+            if ((ScalarConvert<scalar_t, accscalar_t>::to(val) > maxval) || THCNumerics<scalar_t>::isnan(val)) {


pytorch/aten/src/THC/THCNumerics.cuh

Lines 403 to 407 in b2f6e2b

// DEPRECATED: use static_cast in kernels instead of scalar_cast

template <typename T, typename U>

__host__ __device__ T scalar_cast(U u) {

return ScalarConvert<U, T>::to(u);

}

We might just use static_cast

jjsjann123 · 2019-08-20T23:41:05Z

aten/src/ATen/native/cuda/DilatedMaxPool2d.cu

+          const scalar_t *ptr_input = bottom_data + ih * in_stride_h + iw * in_stride_w;
+          for(int c = threadIdx.x; c < channels; c+= blockDim.x) {
+            scalar_t val = ptr_input[c];
+            scalar_t maxval = out_cached[2 * c];


NIT: maxval doesn't seem to be necessary here;

jjsjann123 · 2019-08-20T23:55:00Z

aten/src/ATen/native/cuda/DilatedMaxPool2d.cu

+            scalar_t maxval = out_cached[2 * c];
+            if ((ScalarConvert<scalar_t, accscalar_t>::to(val) > maxval) || THCNumerics<scalar_t>::isnan(val)) {
+              out_cached[2 * c] = ScalarConvert<scalar_t, accscalar_t>::to(val);
+              out_cached[2 * c + 1] = ih * width + iw;


This is dangerous: we should not convert index to scalar_t.
For fp16 kernel, mantissa is only 10 bit, so we have a range of 1023 (to be fair, on both side, but index is only going to be positive). Any index beyond that will give us error here.

We need to change the line earlier to

int *out_mask_cached = smem[]; scalar_t *out_cached = reinterpret_cast<scalar_t*>(&out_mask_cached[channels*height*width]);

This will also require updating the allocation of dynamic shared memory in the launch config

jjsjann123 · 2019-08-21T00:31:17Z

aten/src/ATen/native/cuda/DilatedMaxPool2d.cu

+  for (int ih = istartH; ih < iendH; ih+=blockDim.z) {
+    for (int iw = istartW; iw < iendW; iw+=blockDim.y) {
+      int phstart, phend, pwstart, pwend;
+      if (stride_h == 1) {


NIT: Looks like we are using the same logic as with the NCHW kernel here. Maybe we want to combine the code and avoid two copies of the same code here. So later if we want to clean up the code, it would be easier.

into max_pool2d_cuda_perf

facebook-github-bot

@ifedan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin · 2019-08-22T21:34:43Z

aten/src/ATen/native/cuda/DilatedMaxPool2d.cu

+                  output_data, indices_data);
+          break;
+        }
+        default: AT_ERROR("Unsupported memory format. Supports only ChannelsLast, Contiguous");


// Deprecated alias; this alias was deprecated because it represents extra API // surface that makes it hard for people to understand what macro to use. // Use TORCH_CHECK(false, ...) or TORCH_INTERNAL_ASSERT(false, ...) to // unconditionally fail at a line of code. #define AT_ERROR(...) \ do { \ ::c10::detail::deprecated_AT_ERROR(); \ C10_EXPAND_MSVC_WORKAROUND(TORCH_CHECK(false, ::c10::str(__VA_ARGS__))); \ } while (false)

VitalyFedyunin

LGTM! Not approving to avoid accidental merge (we want to benchmark everything first in the separate branch).

facebook-github-bot

@ifedan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ifedan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin · 2019-10-17T20:47:00Z

Please rebase we are getting ready to land it.

VitalyFedyunin

Please add double backward test

VitalyFedyunin · 2019-10-18T15:29:56Z

aten/src/ATen/native/cuda/DilatedMaxPool2d.cu

+                  gradInput_data);
+          break;
+        }
+        default: AT_ERROR("Unsupported memory format. Supports only ChannelsLast, Contiguous");


TORCH_CHECK please

VitalyFedyunin · 2019-10-18T15:33:13Z

tools/autograd/templates/Functions.cpp

  auto indices_view = indices.view(size);
-  return grad.contiguous().view(size).gather(-1, indices_view).view(indices.sizes());
+  const auto memory_format = indices.suggest_memory_format();
+  return grad.contiguous(memory_format).view(size).gather(-1, indices_view).view(indices.sizes());


We need to add test for double backward

It will be checked through this magic param: check_with_channels_last=True

facebook-github-bot

@ifedan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin

Feel free to land at Monday as soon as all tests are green.

facebook-github-bot

@ifedan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ifedan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…e improvement] (#24872) Summary: max_pool2d_with_indices_cuda and max_pool2d_with_indices_backward_cuda should have channel last optimized kernels(pytorch/pytorch#23815) Pull Request resolved: pytorch/pytorch#24872 Differential Revision: D16964577 Pulled By: ifedan fbshipit-source-id: 296dfef8e511a7ae2ed423e34e902d5401b3becb

facebook-github-bot · 2019-10-21T21:11:50Z

@ifedan merged this pull request in bc57967.

…e improvement] (pytorch#24872) Summary: max_pool2d_with_indices_cuda and max_pool2d_with_indices_backward_cuda should have channel last optimized kernels(pytorch#23815) Pull Request resolved: pytorch#24872 Differential Revision: D16964577 Pulled By: ifedan fbshipit-source-id: 296dfef8e511a7ae2ed423e34e902d5401b3becb

ifedan added 11 commits August 13, 2019 18:32

Channel last memory format

f1eddc7

Channel last memory format

0e2f384

Added tests

17fab37

Channel last memory format

96ec19c

Channel last memory format

9d0a6cb

Channel last memory format

c7ece81

Revert "Channel last memory format"

e6b295a

This reverts commit c7ece81.

Channel last memory format

8b9ed8e

Channel last memory format. Multiple kernels

f436465

Channel last memory format. Multiple kernels

11bbd02

Performance improvement

55af4d8

pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen module: nn Related to torch.nn module: operators labels Aug 19, 2019

Performance improvement

d84dc8b

ifedan mentioned this pull request Aug 20, 2019

max_pool2d cuda should have channel last optimized kernels #24289

Closed

ifedan requested a review from VitalyFedyunin August 20, 2019 03:26

ifedan mentioned this pull request Aug 20, 2019

max_pool2d_with_indices_cuda and max_pool2d_with_indices_backward_cuda should have channel last optimized kernels. #23815

Closed

jjsjann123 reviewed Aug 20, 2019

View reviewed changes

jjsjann123 reviewed Aug 21, 2019

View reviewed changes

ifedan added 2 commits August 22, 2019 16:53

Changes based on PR review

51a85b7

Merge branch 'max_pool2d_cuda_perf' of https://github.com/ifedan/pytorch

f006db6

into max_pool2d_cuda_perf

facebook-github-bot reviewed Aug 22, 2019

View reviewed changes

ifedan requested a review from VitalyFedyunin August 22, 2019 21:28

VitalyFedyunin reviewed Aug 22, 2019

View reviewed changes

Changes based on PR comments

248c0a0

facebook-github-bot reviewed Aug 22, 2019

View reviewed changes

facebook-github-bot reviewed Aug 31, 2019

View reviewed changes

VitalyFedyunin suggested changes Oct 18, 2019

View reviewed changes

ifedan added 4 commits October 18, 2019 14:19

Changes based on PR review

083e006

Merge master to max_pool2d_cuda_perf

a3809e6

Fix merge issue

7c1a8c3

Fix flake8 issue

54ca409

facebook-github-bot reviewed Oct 18, 2019

View reviewed changes

VitalyFedyunin approved these changes Oct 18, 2019

View reviewed changes

Fix merge issue

dfdb405

facebook-github-bot reviewed Oct 18, 2019

View reviewed changes

Merge branch 'master' into max_pool2d_cuda_perf

08a6abd

facebook-github-bot reviewed Oct 21, 2019

View reviewed changes

facebook-github-bot closed this in bc57967 Oct 21, 2019

facebook-github-bot added the merged label Oct 21, 2019

mruberry added the Merged label Oct 28, 2020

ezyang mentioned this pull request May 10, 2021

max_pool2d_with_indices_backward: port to structured #57797

Closed

	// DEPRECATED: use static_cast in kernels instead of scalar_cast
	template <typename T, typename U>
	__host__ __device__ T scalar_cast(U u) {
	return ScalarConvert<U, T>::to(u);
	}

max_pool2d cuda should have channel last optimized kernels[Performance improvement] #24872

max_pool2d cuda should have channel last optimized kernels[Performance improvement] #24872

Uh oh!

Conversation

ifedan commented Aug 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ifedan commented Aug 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ifedan commented Aug 20, 2019

Uh oh!

jjsjann123 commented Aug 20, 2019

Uh oh!

ifedan commented Aug 20, 2019

Uh oh!

jjsjann123 commented Aug 20, 2019

Uh oh!

ifedan commented Aug 20, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin commented Oct 17, 2019

Uh oh!

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ifedan Oct 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

ifedan commented Aug 19, 2019 •

edited

Loading

ifedan commented Aug 20, 2019 •

edited

Loading

ifedan Oct 18, 2019 •

edited

Loading