Performance improvements for depthwise convolutions in FP16 #22302

ptrblck · 2019-06-27T14:21:00Z

This PR activates faster depthwise convolution kernels for Volta and Turing GPUs using cudnn >= 7600.
The script to benchmark the current PyTorch master branch and this PR branch can be found here.
(50 warmup iterations, 1000 iterations for timing)

I've used #3265 to create a similar benchmark and added a few additional setups.
Since the results are quite long, I've uploaded them in a spreadsheet here.
Times are given in ms per iteration.
We've benchmarked this PR on a DGX1 using V100 GPUs.

The current workload check in check_cudnn_depthwise_workload is quite long and can be moved to another file, if wanted.

CC @ngimel (Thanks for the support while benchmarking it ;) )

ngimel · 2019-06-27T16:11:10Z

aten/src/ATen/native/Convolution.cpp

+  int w = input.size(3);  // same as h
+  int ch = input.size(1);
+  int bs = input.size(0);
+  int k = weight.size(2); // kernel size


you never use k in this function

You are right! That's some dead code and I'll remove it.

li-roy · 2019-06-28T03:37:46Z

@pytorchbot rebase this please

soumith · 2019-06-29T04:49:25Z

@pytorchbot rebase this please

zhangguanheng66 · 2019-07-01T22:39:36Z

@pytorchbot rebase this please

pytorchbot · 2019-07-01T22:39:38Z

Sorry, only maintainers are authorized to rebase other people's PRs. Feel free to try again on one of your PRs!

(To learn more about this bot, see Bot commands.)

zhangguanheng66 · 2019-07-01T22:40:08Z

@pytorchbot retest this please

ngimel · 2019-07-01T22:54:19Z

@ptrblck windows build failure looks real, it does not like "and" apparently.

ptrblck · 2019-07-02T12:53:34Z

@ngimel Thanks for the information!
I wasn't aware that alternative tokens might cause trouble in Windows.
I've updated it to &&.

ptrblck · 2019-07-02T16:09:41Z

@pytorchbot retest this please

ngimel · 2019-07-03T20:30:04Z

@pytorchbot rebase this please

facebook-github-bot

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2019-07-04T00:15:33Z

aten/src/ATen/native/Convolution.cpp

+auto ConvParams::use_cudnn_depthwise(
+        const at::Tensor& input, const at::Tensor& weight) const -> bool {
+  #if AT_CUDNN_ENABLED()
+    cudaDeviceProp* prop = at::cuda::getCurrentDeviceProperties();


you should not be calling getCurrentDeviceProperties() directly here, instead add CUDAHooks::supportsDepthwiseConvolutionWithCuDNN to cuda/detail/CUDAHooks.cpp, like it's currently done for CUDAHooks::supportsDilatedConvolutionWithCuDNN(). That would also allow you to not use AT_CUDNN_ENABLED macro.

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: This PR activates faster depthwise convolution kernels for Volta and Turing GPUs using cudnn >= 7600. The script to benchmark the current PyTorch master branch and this PR branch can be found [here](https://gist.github.com/ptrblck/4590cf20721d8f43296c9903abd4a774). (50 warmup iterations, 1000 iterations for timing) I've used pytorch/pytorch#3265 to create a similar benchmark and added a few additional setups. Since the results are quite long, I've uploaded them in a spreadsheet [here](https://docs.google.com/spreadsheets/d/13ByXcqg7LQUr3DVG3XpLwnJ-CXg3GUZJ3puyTMw9n2I/edit?usp=sharing). Times are given in ms per iteration. We've benchmarked this PR on a DGX1 using V100 GPUs. The current workload check in `check_cudnn_depthwise_workload` is quite long and can be moved to another file, if wanted. CC ngimel (Thanks for the support while benchmarking it ;) ) Pull Request resolved: pytorch/pytorch#22302 Differential Revision: D16115057 Pulled By: ezyang fbshipit-source-id: bad184658518e73b4d6b849d77e408f5a7a757de

facebook-github-bot · 2019-07-09T16:03:54Z

@ezyang merged this pull request in a3346e1.

yaysummeriscoming · 2019-07-19T08:12:49Z

Hi I’m very excited this is in :) . I’m able to reproduce the individual depthwise convolution tests as presented in the spreadsheet - very impressive gains of up to 400%.

I decided to test this with MobileNet V2, however unfortunately I’m only seeing speedups of ~10%. My understanding is that depthwise convolution is the slowest link when training such lightweight networks, so this seems quite low to me?

I’ve modified the test script to use MobileNet V2 here:
https://gist.github.com/yaysummeriscoming/88ae59bc5b7ba8581ea396d8ce87d28f

I had a go at profiling with the autograd profiler, but that doesn’t delineate between point wise & depthwise convolution.

Any pointers? Could it be that cuDNN isn’t optimised for pointwise convolutions with large input/output channel ratios?

(I’m assuming this is the right place to ask this, please correct me if not)

jph00 · 2019-10-25T23:52:20Z

@ptrblck thanks for this great PR and helpful benchmarking. I've created a couple of summary tables of the benchmarks that might be helpful. Here's speedup by kernel size and stride, by height/width:

And here's the details of h/w by num channels, for just the stride one and kernel size 3 rows:

Have you tried benchmarking 5x5 convs? They are used a lot in efficientnet so would be great if they're fast...

cc @ngimel

jph00 · 2019-10-26T00:07:27Z

I just tried 5x5 convs and it appears they are not optimized for tensor cores - they ran at about the same speed for fp16 vs fp32.

bearpelican · 2019-10-26T00:16:04Z

Looks like cudnn is only enabled for 1x1 and 3x3 - https://github.com/pytorch/pytorch/pull/22302/files#diff-57ac615408468d3c7a461e505581bea3R316.

Would enabling 5x5 change anything?

ngimel · 2019-10-26T00:18:38Z

IIRC, cudnn only had fast implementations for 1x1 and 3x3, so just enabling it for 5x5 is unlikely to dramatically speed things up.

jph00 · 2019-10-26T00:21:53Z

Thanks @ngimel . Is there any plan to add a 5x5 implementation? If not, could I twist your arm to create such a plan... ;)

ngimel · 2019-10-26T00:22:31Z

Not mine, I'm not with nvidia anymore :-)

jph00 · 2019-10-26T00:27:28Z

Oh yes so I see! Welcome to Facebook then :)

andravin · 2020-02-21T20:29:49Z

Hi @ptrblck and @ngimel , just to clarify a point that is causing some confusion: although this patch uses cuDNN kernels for depthwise convolution on Volta, do those kernels actually use tensor cores?

Depthwise convolution is basically planar convolution nested inside of diagonal matrix multiplication. I do not see how that could be made faster with 8m x 8n x 4k matrix multiplication fragments. It seems like one of the input matrices would be diagonal, and (at most) 1/8th of the matrix elements would be nonzero, reducing the effective arithmetic throughput of the tensor cores to the level of fp32 core arithmetic throughput.

ngimel · 2020-02-21T21:04:39Z

@andravin you are right, those kernels don't use tensor cores.

andravin · 2020-02-21T21:29:45Z

@ngimel do they use hfma2? fp16 accumulation might be OK for 3x3 depth-wise.

Otherwise I am stumped why these kernels are Volta only. Also, P100 had hfma2.

Summary: Follow up of #38044. Thanks ptrblck, mcarilli for the help on discussing the changes! Could fix #37725 by skipping the depthwise-workload check introduced in #22302. This PR also relaxed dilated convolution for channels-last. The testing script is https://gist.github.com/xwang233/82a707f69bb710cb612349280a2c5f41. About 387k conv arguments were tested and no cudnn exception was thrown. cc ngimel VitalyFedyunin ptrblck mcarilli Pull Request resolved: #38904 Differential Revision: D22155797 Pulled By: VitalyFedyunin fbshipit-source-id: 81b5736cec67ea263029121521c6acafd9dddba6

Summary: There are multiple improvement of depthwise convolution speed in cudnn between 7.6 and 8.2, since #22302. This PR aim to harvest all the new improvement by enable more cudnn kernel. The workload checking logic can also be simplified now. To keep the change simple, I kept things before cudnn 8.2 unchanged. Similar to #22302, I used a script [here](https://gist.github.com/FDecaYed/e8ba98a95cd33697df2ace86fdb44897) to benchmark. Both run are using cudnn 8.2 One enhancement I did to the script is switch to event based timing. With warmup kernels to fill the launch queue ahead, this should give us accurate kernel timing even in CPU launch bound cases. Here is A100 and V100 result sorted by speedup. [Book1.xlsx](https://github.com/pytorch/pytorch/files/6530371/Book1.xlsx) Result highlights: Newly turned on 5x5 cudnn kernel show up to 6x speedup. Close to half of test sizes show >10% speedup. Fixed some corner cases that previously caused 15-20x slowdown. Only slowdown a handful of cases(~10 out of >1000) Pull Request resolved: #58749 Reviewed By: bdhirsh Differential Revision: D31613199 Pulled By: ngimel fbshipit-source-id: 883b58facad67ccd51dc9ab539368b4738d40398

ptrblck added 10 commits June 14, 2019 13:13

enable full cudnn run

fdb253c

enable cudnn for fp16 depthwise convolutions

7db9720

add cudnn version check

60e4201

wrong type for cudnn version

283fbc2

debug

c2c8388

cleanup

162cb7b

Merge branch 'cudnn' of https://github.com/ptrblck/pytorch into cudnn

94f6d52

Merge branch 'master' into cudnn

95b16e2

cleanup

36e9e4b

remove unused method

8fa0d8a

pytorchbot added the module: operators label Jun 27, 2019

ezyang added the open source label Jun 27, 2019

ngimel approved these changes Jun 27, 2019

View reviewed changes

pytorchbot and others added 2 commits June 28, 2019 03:37

Merge remote-tracking branch 'origin/master' into HEAD

8e88c04

remove unused kernel_size

37a5bb9

Merge remote-tracking branch 'origin/master' into HEAD

814b3eb

ptrblck added 2 commits July 2, 2019 14:49

swap and for &&

803e841

Merge branch 'cudnn' of https://github.com/ptrblck/pytorch into cudnn

19c175a

Merge remote-tracking branch 'origin/master' into HEAD

63c296c

facebook-github-bot reviewed Jul 3, 2019

View reviewed changes

ngimel reviewed Jul 4, 2019

View reviewed changes

pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label Jul 8, 2019

facebook-github-bot reviewed Jul 8, 2019

View reviewed changes

facebook-github-bot closed this in a3346e1 Jul 9, 2019

facebook-github-bot added the merged label Jul 9, 2019

ptrblck deleted the cudnn branch July 14, 2019 22:44

yaysummeriscoming mentioned this pull request Aug 22, 2019

Use CuDNN when we are doing Depthwise Convolution, when they get fast #15513

Closed

ptrblck mentioned this pull request May 7, 2020

Relax depthwise conditions for channels-last convs #38044

Closed

xwang233 mentioned this pull request May 22, 2020

Relax cudnn conditions for channels-last convolutions #38904

Closed

taketakeseijin mentioned this pull request Oct 17, 2020

FP32 depthwise convolution is slow in GPU #18631

Open

mruberry added the Merged label Oct 28, 2020

FDecaYed mentioned this pull request May 21, 2021

enable better depthwise conv perf on cudnn 8.2+ #58749

Closed

gderossi mentioned this pull request Dec 16, 2025

Add auto-generated cuDNN depthwise conv heuristic #170609

Draft

Performance improvements for depthwise convolutions in FP16 #22302

Performance improvements for depthwise convolutions in FP16 #22302

Uh oh!

Conversation

ptrblck commented Jun 27, 2019

Uh oh!

ngimel Jun 27, 2019

Choose a reason for hiding this comment

Uh oh!

ptrblck Jun 28, 2019

Choose a reason for hiding this comment

Uh oh!

li-roy commented Jun 28, 2019

Uh oh!

soumith commented Jun 29, 2019

Uh oh!

zhangguanheng66 commented Jul 1, 2019

Uh oh!

pytorchbot commented Jul 1, 2019

Uh oh!

zhangguanheng66 commented Jul 1, 2019

Uh oh!

ngimel commented Jul 1, 2019

Uh oh!

ptrblck commented Jul 2, 2019

Uh oh!

ptrblck commented Jul 2, 2019

Uh oh!

ngimel commented Jul 3, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel Jul 4, 2019

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 9, 2019

Uh oh!

yaysummeriscoming commented Jul 19, 2019

Uh oh!

jph00 commented Oct 25, 2019

Uh oh!

jph00 commented Oct 26, 2019

Uh oh!

bearpelican commented Oct 26, 2019

Uh oh!

ngimel commented Oct 26, 2019

Uh oh!

jph00 commented Oct 26, 2019

Uh oh!

ngimel commented Oct 26, 2019

Uh oh!

jph00 commented Oct 26, 2019

Uh oh!

andravin commented Feb 21, 2020

Uh oh!

ngimel commented Feb 21, 2020

Uh oh!

andravin commented Feb 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants