Spatial Depthwise Convolution on the GPU #3057

killeent · 2017-10-10T21:42:47Z

Partially addresses #1708. Currently works for Conv2D but needs testing.

TODO:

Extend gradWeight kernel to handle padding, strides, and dilation
Extend all kernels to handle a depthwise multiplier, i.e a depthwise convolution with multiple filters per input channel
Add support for bias
Make gradWeight (and other kernels) more efficient by leveraging better CUDA programming paradigms
Benchmark performance on individual layers and models that leverage Depthwise Convolution, see if any other perf wins can be made
Finish writing unit tests

killeent · 2017-10-13T17:39:27Z

Some preliminary results (the parameters for the layers and inputs are taken from MobileNet). All times are for running 50 iterations of Forward/Backward. The trends:

As the number of channels (and thus groups) increases, the new code becomes faster and faster, because we are replacing a bigger and bigger loop with a single call
As the batch size increases, these performance gains are dampened, but still non-trivial

Batch Size	Input Channels	Height	Width	kH	kW	Stride	time (new)	time (old)	Speedup
1	32	112	112	1	1	1	0.2129	0.0076	28x
1	64	112	112	3	3	2	0.1937	0.0143	13.5x
1	128	56	56	3	3	1	0.4229	0.0241	17.5x
1	128	56	56	3	3	2	0.3264	0.0112	29x
1	256	28	28	3	3	1	0.7379	0.0207	35.5x
1	256	28	28	3	3	2	0.7522	0.0059	127x
1	512	14	14	3	3	1	1.406	0.0097	145x
1	512	14	14	3	3	2	1.494	0.0052	287x
1	1024	7	7	3	3	2	2.919	0.0053	550x
64	32	112	112	1	1	1	3.603	0.3102	11.5x
64	64	112	112	3	3	2	2.261	0.5031	4.5x
64	128	56	56	3	3	1	4.343	0.9285	4.5x
64	128	56	56	3	3	2	1.445	0.2310	6x
64	256	28	28	3	3	1	3.586	0.4187	8.5x
64	256	28	28	3	3	2	1.043	0.1168	9x
64	512	14	14	3	3	1	2.489	0.1957	12.5x
64	512	14	14	3	3	2	1.283	0.0676	18.5x
64	1024	7	7	3	3	2	2.238	0.0686	32.5x
128	32	112	112	1	1	1	7.008	0.5918	11.5x
128	64	112	112	3	3	2	3.825	0.9979	3.5x
128	128	56	56	3	3	1	8.922	1.844	4.5x
128	128	56	56	3	3	2	2.419	0.4574	5x
128	256	28	28	3	3	1	9.471	0.8292	11x
128	256	28	28	3	3	2	1.352	0.2224	6x
128	512	14	14	3	3	1	7.997	0.3735	21.5x
128	512	14	14	3	3	2	1.371	0.1128	12x
128	1024	7	7	3	3	2	2.383	0.0998	24x

killeent · 2017-10-13T17:46:58Z

Marking this as ready to review. I still need to write tests, but while I do that I think its worth looking at the kernels, and also the integration, which is a little messy.

torch/lib/THCUNN/SpatialDepthwiseConvolution.cu

torch/lib/THCUNN/generic/SpatialDepthwiseConvolution.cu

test/test_nn.py

torch/lib/THCUNN/SpatialDepthwiseConvolution.cu

torch/csrc/autograd/functions/convolution.h

torch/lib/THCUNN/SpatialDepthwiseConvolution.cu

test/test_nn.py

torch/lib/THCUNN/generic/SpatialDepthwiseConvolution.cu

…r only, not bias)

…d integration

…ount for accgradparams

…epth multiplier

…n appropriate

killeent · 2017-10-16T18:22:26Z

Okay, I addressed everything except:

Half unit tests, I will need @colesbury to add support for binding THCUNN half operations in ATen, or do that myself
I made the kernels use mutliplication instead of division, and accumulation types, where appropriate, but did not address any other perf stuff. If you would like there to be further perf improvements, I would rather do those in a separate PR

torch/csrc/autograd/functions/convolution.cpp

+      auto padding = vecToInt64(this->padding);
+      auto dilation = vecToInt64(this->dilation);
+
+      at::conv_depthwise2d_forward_out(output, input, weight, kernel_size, bias, stride, padding, dilation);


torch/csrc/autograd/functions/convolution.cpp

+      if (output_mask[2]) {
+        grad_bias = bias.type().tensor();
+        grad_bias.resize_as_(bias).zero_();
+        update_grad_bias(grad_output, grad_bias);


torch/lib/THCUNN/SpatialDepthwiseConvolution.cu

+          value = THCNumerics<AccT>::add(
            value,
-            THCNumerics<T>::mul(weight.data()[weightOffset], input.data()[offset]));
+            ScalarConvert<T, AccT>::to(THCNumerics<T>::mul(weight.data()[weightOffset], input.data()[offset])));


killeent · 2017-10-16T20:45:04Z

Okay, I added support for Half-Precision. Let me know if there is anything else I need to do.

ngimel · 2017-10-16T20:53:22Z

LGTM. Did you figure out support for binding THCUNN half operations in ATen?

killeent · 2017-10-16T20:58:33Z

@ngimel - yeah, we just needed to add an extra parameter to @colesbury's nn_parse script - ATen itself already had support for handling half.

ngimel · 2017-10-16T21:52:33Z

Cool! Will it also fix #2435? It is still broken.

killeent · 2017-10-17T13:22:28Z

@ngimel I ran that repro script and it didn't crash so I think so.

KeCh96 · 2018-02-02T11:33:56Z

I have upgrade my pytorch to 0.3.0, but I found m = nn.Conv2d(128, 256, kernel_size=3, groups=128) is still 2 times slower than m = nn.Conv2d(128, 256, kernel_size=3). I am really confused by this problem, should I need to upgrade pytorch to other version? Or use other operation?

cddlyf · 2018-02-06T08:53:27Z

I have tried the depthwise convolution with nn.Conv2d(64,64, 3, 1, 1, groups=64), but it is only around 2x faster than nn.Conv2d(64,64,3,1,1). The input size is 1x64x256x256, could you tell me what's wrong? my pytorch version is 0.3.0.post4

cddlyf · 2018-02-07T01:52:16Z

@killeent @ngimel could you give me some hints? How to use the optimized depthwise convolution? Does it requires latest pytorch or cudnn or not?

ouceduxzk · 2018-03-22T15:07:57Z

@cddlyf me too, I found most existing code of separable_conv still use conv with group

killeent changed the title ~~[WIP] 2D Depthwise Convolution on the GPU~~ [WIP] Spatial Depthwise Convolution on the GPU Oct 10, 2017

killeent mentioned this pull request Oct 13, 2017

1D Bias Vector in SpatialDepthwiseConvolution torch/nn#1277

Closed

killeent force-pushed the depthwise-conv2d branch from 179816c to d331efb Compare October 13, 2017 15:20

killeent changed the title ~~[WIP] Spatial Depthwise Convolution on the GPU~~ Spatial Depthwise Convolution on the GPU Oct 13, 2017

ngimel reviewed Oct 13, 2017

View reviewed changes

ngimel reviewed Oct 14, 2017

View reviewed changes

test/test_nn.py Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

ngimel reviewed Oct 14, 2017

View reviewed changes

apaszke reviewed Oct 14, 2017

View reviewed changes

killeent added 20 commits October 16, 2017 06:52

THCUNN Skeleton for Depthwise Convolution port

1c10154

implement Depthwise Convolution CUDA Kernels (handles weight paramete…

c70bbec

…r only, not bias)

working kernels and bindings for forward + backward for base conv, an…

d40c0a5

…d integration

add support for padding

2edfee3

strides for weight kernel

2591584

dilation for weight gradient, enable for others

2fe8e41

add support for depthwise multiplier

7f66883

remove old depthwise conv

2bb8dff

rename to SpatialDepthwiseConvolution

f9924de

clean up depthwise code, add shape asserts, more constrained thread c…

4e76b7d

…ount for accgradparams

add bias for forward for depthwise conv

5049f5d

add grad_bias, move bias for forward to CUDA

57a2a9b

fix eligibility test to guard against transposed, properly identify d…

9ea787d

…epth multiplier

add basic unit test; make depthwise conv take priority over cudnn whe…

5cffe84

…n appropriate

add tests for depthwise permutations

8e4ef01

make cuda kernels calculate positions using mul instead of div

834a282

remove unnecessary samegpu requirement

45cfdc8

use accreal, test for double type

a05baf8

use THAssert instead of assert

3727f65

rename to is_depthwise

0feeca8

killeent force-pushed the depthwise-conv2d branch from b832c53 to 0feeca8 Compare October 16, 2017 18:20

colesbury reviewed Oct 16, 2017

View reviewed changes

ngimel reviewed Oct 16, 2017

View reviewed changes

killeent added 2 commits October 16, 2017 13:34

half prec support for depthwise

1d49044

make certain computation more pythonic

b6838fc

flake8

65cc6f1

soumith merged commit 7680601 into pytorch:master Oct 18, 2017

This was referenced Oct 18, 2017

feature request: depthwise separable convolution #1708

Closed

RuntimeError: BatchNormalization_updateOutput is not implemented for type CUDAHalfType #2435

Closed

ngimel mentioned this pull request Oct 24, 2017

perf improvements for depthwise convolutions #3265

Merged

aussetg mentioned this pull request Nov 9, 2017

NASNet Model pytorch/vision#321

Open

ezyang added the open source label Jun 24, 2019

fmac2000 mentioned this pull request Dec 6, 2024

Depthwise convolution in flax.nnx.Conv is significantly slower than PyTorch and TensorFlow jax-ml/jax#23713

Open

Spatial Depthwise Convolution on the GPU #3057

Spatial Depthwise Convolution on the GPU #3057

Uh oh!

Conversation

killeent commented Oct 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

killeent commented Oct 13, 2017

Uh oh!

killeent commented Oct 13, 2017

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

killeent commented Oct 16, 2017

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

killeent commented Oct 16, 2017

Uh oh!

ngimel commented Oct 16, 2017

Uh oh!

killeent commented Oct 16, 2017

Uh oh!

ngimel commented Oct 16, 2017

Uh oh!

killeent commented Oct 17, 2017

Uh oh!

KeCh96 commented Feb 2, 2018

killeent commented Oct 10, 2017 •

edited

Loading

cddlyf commented Feb 6, 2018 •

edited

Loading