-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Performance improvements for depthwise convolutions in FP16 #22302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
aten/src/ATen/native/Convolution.cpp
Outdated
| int w = input.size(3); // same as h | ||
| int ch = input.size(1); | ||
| int bs = input.size(0); | ||
| int k = weight.size(2); // kernel size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you never use k in this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right! That's some dead code and I'll remove it.
|
@pytorchbot rebase this please |
|
@pytorchbot rebase this please |
|
@pytorchbot rebase this please |
|
Sorry, only maintainers are authorized to rebase other people's PRs. Feel free to try again on one of your PRs! (To learn more about this bot, see Bot commands.) |
|
@pytorchbot retest this please |
|
@ptrblck windows build failure looks real, it does not like "and" apparently. |
|
@ngimel Thanks for the information! |
|
@pytorchbot retest this please |
|
@pytorchbot rebase this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
aten/src/ATen/native/Convolution.cpp
Outdated
| auto ConvParams::use_cudnn_depthwise( | ||
| const at::Tensor& input, const at::Tensor& weight) const -> bool { | ||
| #if AT_CUDNN_ENABLED() | ||
| cudaDeviceProp* prop = at::cuda::getCurrentDeviceProperties(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should not be calling getCurrentDeviceProperties() directly here, instead add CUDAHooks::supportsDepthwiseConvolutionWithCuDNN to cuda/detail/CUDAHooks.cpp, like it's currently done for CUDAHooks::supportsDilatedConvolutionWithCuDNN(). That would also allow you to not use AT_CUDNN_ENABLED macro.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: This PR activates faster depthwise convolution kernels for Volta and Turing GPUs using cudnn >= 7600. The script to benchmark the current PyTorch master branch and this PR branch can be found [here](https://gist.github.com/ptrblck/4590cf20721d8f43296c9903abd4a774). (50 warmup iterations, 1000 iterations for timing) I've used pytorch/pytorch#3265 to create a similar benchmark and added a few additional setups. Since the results are quite long, I've uploaded them in a spreadsheet [here](https://docs.google.com/spreadsheets/d/13ByXcqg7LQUr3DVG3XpLwnJ-CXg3GUZJ3puyTMw9n2I/edit?usp=sharing). Times are given in ms per iteration. We've benchmarked this PR on a DGX1 using V100 GPUs. The current workload check in `check_cudnn_depthwise_workload` is quite long and can be moved to another file, if wanted. CC ngimel (Thanks for the support while benchmarking it ;) ) Pull Request resolved: pytorch/pytorch#22302 Differential Revision: D16115057 Pulled By: ezyang fbshipit-source-id: bad184658518e73b4d6b849d77e408f5a7a757de
|
Hi I’m very excited this is in :) . I’m able to reproduce the individual depthwise convolution tests as presented in the spreadsheet - very impressive gains of up to 400%. I decided to test this with MobileNet V2, however unfortunately I’m only seeing speedups of ~10%. My understanding is that depthwise convolution is the slowest link when training such lightweight networks, so this seems quite low to me? I’ve modified the test script to use MobileNet V2 here: I had a go at profiling with the autograd profiler, but that doesn’t delineate between point wise & depthwise convolution. Any pointers? Could it be that cuDNN isn’t optimised for pointwise convolutions with large input/output channel ratios? (I’m assuming this is the right place to ask this, please correct me if not) |
|
@ptrblck thanks for this great PR and helpful benchmarking. I've created a couple of summary tables of the benchmarks that might be helpful. Here's speedup by kernel size and stride, by height/width: And here's the details of h/w by num channels, for just the stride one and kernel size 3 rows: Have you tried benchmarking 5x5 convs? They are used a lot in efficientnet so would be great if they're fast... cc @ngimel |
|
I just tried 5x5 convs and it appears they are not optimized for tensor cores - they ran at about the same speed for fp16 vs fp32. |
|
Looks like cudnn is only enabled for 1x1 and 3x3 - https://github.com/pytorch/pytorch/pull/22302/files#diff-57ac615408468d3c7a461e505581bea3R316. Would enabling 5x5 change anything? |
|
IIRC, cudnn only had fast implementations for 1x1 and 3x3, so just enabling it for 5x5 is unlikely to dramatically speed things up. |
|
Thanks @ngimel . Is there any plan to add a 5x5 implementation? If not, could I twist your arm to create such a plan... ;) |
|
Not mine, I'm not with nvidia anymore :-) |
|
Oh yes so I see! Welcome to Facebook then :) |
|
Hi @ptrblck and @ngimel , just to clarify a point that is causing some confusion: although this patch uses cuDNN kernels for depthwise convolution on Volta, do those kernels actually use tensor cores? Depthwise convolution is basically planar convolution nested inside of diagonal matrix multiplication. I do not see how that could be made faster with |
|
@andravin you are right, those kernels don't use tensor cores. |
|
@ngimel do they use Otherwise I am stumped why these kernels are Volta only. Also, P100 had |
Summary: Follow up of #38044. Thanks ptrblck, mcarilli for the help on discussing the changes! Could fix #37725 by skipping the depthwise-workload check introduced in #22302. This PR also relaxed dilated convolution for channels-last. The testing script is https://gist.github.com/xwang233/82a707f69bb710cb612349280a2c5f41. About 387k conv arguments were tested and no cudnn exception was thrown. cc ngimel VitalyFedyunin ptrblck mcarilli Pull Request resolved: #38904 Differential Revision: D22155797 Pulled By: VitalyFedyunin fbshipit-source-id: 81b5736cec67ea263029121521c6acafd9dddba6
Summary: There are multiple improvement of depthwise convolution speed in cudnn between 7.6 and 8.2, since #22302. This PR aim to harvest all the new improvement by enable more cudnn kernel. The workload checking logic can also be simplified now. To keep the change simple, I kept things before cudnn 8.2 unchanged. Similar to #22302, I used a script [here](https://gist.github.com/FDecaYed/e8ba98a95cd33697df2ace86fdb44897) to benchmark. Both run are using cudnn 8.2 One enhancement I did to the script is switch to event based timing. With warmup kernels to fill the launch queue ahead, this should give us accurate kernel timing even in CPU launch bound cases. Here is A100 and V100 result sorted by speedup. [Book1.xlsx](https://github.com/pytorch/pytorch/files/6530371/Book1.xlsx) Result highlights: Newly turned on 5x5 cudnn kernel show up to 6x speedup. Close to half of test sizes show >10% speedup. Fixed some corner cases that previously caused 15-20x slowdown. Only slowdown a handful of cases(~10 out of >1000) Pull Request resolved: #58749 Reviewed By: bdhirsh Differential Revision: D31613199 Pulled By: ngimel fbshipit-source-id: 883b58facad67ccd51dc9ab539368b4738d40398
Summary: There are multiple improvement of depthwise convolution speed in cudnn between 7.6 and 8.2, since #22302. This PR aim to harvest all the new improvement by enable more cudnn kernel. The workload checking logic can also be simplified now. To keep the change simple, I kept things before cudnn 8.2 unchanged. Similar to #22302, I used a script [here](https://gist.github.com/FDecaYed/e8ba98a95cd33697df2ace86fdb44897) to benchmark. Both run are using cudnn 8.2 One enhancement I did to the script is switch to event based timing. With warmup kernels to fill the launch queue ahead, this should give us accurate kernel timing even in CPU launch bound cases. Here is A100 and V100 result sorted by speedup. [Book1.xlsx](https://github.com/pytorch/pytorch/files/6530371/Book1.xlsx) Result highlights: Newly turned on 5x5 cudnn kernel show up to 6x speedup. Close to half of test sizes show >10% speedup. Fixed some corner cases that previously caused 15-20x slowdown. Only slowdown a handful of cases(~10 out of >1000) Pull Request resolved: #58749 Reviewed By: bdhirsh Differential Revision: D31613199 Pulled By: ngimel fbshipit-source-id: 883b58facad67ccd51dc9ab539368b4738d40398


This PR activates faster depthwise convolution kernels for Volta and Turing GPUs using cudnn >= 7600.
The script to benchmark the current PyTorch master branch and this PR branch can be found here.
(50 warmup iterations, 1000 iterations for timing)
I've used #3265 to create a similar benchmark and added a few additional setups.
Since the results are quite long, I've uploaded them in a spreadsheet here.
Times are given in ms per iteration.
We've benchmarked this PR on a DGX1 using V100 GPUs.
The current workload check in
check_cudnn_depthwise_workloadis quite long and can be moved to another file, if wanted.CC @ngimel (Thanks for the support while benchmarking it ;) )