-
Notifications
You must be signed in to change notification settings - Fork 26.3k
enable better depthwise conv perf on cudnn 8.2+ #58749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit f4b5067 (more details on the Dr. CI page):
4 failures not recognized by patterns:
❄️ 2 failures tentatively classified as flakybut reruns have not yet been triggered to confirm:
|
|
Updated result spreadsheet with V100 result. @ptrblck |
414c0ac to
f4b5067
Compare
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slowFor more information, please take a look at the CI Flow Wiki. |
|
@ptrblck Let's try to merge this now that cudnn 8.2 is more widely used.
|
ngimel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
But generally, since networks should be in channels-last already, that shouldn't affect much?
|
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Yes this is for NCHW since channel last already call cudnn. Whoever not running channel-last should see some benefit |
fixed condition for 1D conv check
|
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: There are multiple improvement of depthwise convolution speed in cudnn between 7.6 and 8.2, since #22302. This PR aim to harvest all the new improvement by enable more cudnn kernel. The workload checking logic can also be simplified now. To keep the change simple, I kept things before cudnn 8.2 unchanged. Similar to #22302, I used a script [here](https://gist.github.com/FDecaYed/e8ba98a95cd33697df2ace86fdb44897) to benchmark. Both run are using cudnn 8.2 One enhancement I did to the script is switch to event based timing. With warmup kernels to fill the launch queue ahead, this should give us accurate kernel timing even in CPU launch bound cases. Here is A100 and V100 result sorted by speedup. [Book1.xlsx](https://github.com/pytorch/pytorch/files/6530371/Book1.xlsx) Result highlights: Newly turned on 5x5 cudnn kernel show up to 6x speedup. Close to half of test sizes show >10% speedup. Fixed some corner cases that previously caused 15-20x slowdown. Only slowdown a handful of cases(~10 out of >1000) Pull Request resolved: #58749 Reviewed By: bdhirsh Differential Revision: D31613199 Pulled By: ngimel fbshipit-source-id: 883b58facad67ccd51dc9ab539368b4738d40398
Summary: There are multiple improvement of depthwise convolution speed in cudnn between 7.6 and 8.2, since #22302. This PR aim to harvest all the new improvement by enable more cudnn kernel. The workload checking logic can also be simplified now. To keep the change simple, I kept things before cudnn 8.2 unchanged. Similar to #22302, I used a script [here](https://gist.github.com/FDecaYed/e8ba98a95cd33697df2ace86fdb44897) to benchmark. Both run are using cudnn 8.2 One enhancement I did to the script is switch to event based timing. With warmup kernels to fill the launch queue ahead, this should give us accurate kernel timing even in CPU launch bound cases. Here is A100 and V100 result sorted by speedup. [Book1.xlsx](https://github.com/pytorch/pytorch/files/6530371/Book1.xlsx) Result highlights: Newly turned on 5x5 cudnn kernel show up to 6x speedup. Close to half of test sizes show >10% speedup. Fixed some corner cases that previously caused 15-20x slowdown. Only slowdown a handful of cases(~10 out of >1000) Pull Request resolved: #58749 Reviewed By: bdhirsh Differential Revision: D31613199 Pulled By: ngimel fbshipit-source-id: 883b58facad67ccd51dc9ab539368b4738d40398
There are multiple improvement of depthwise convolution speed in cudnn between 7.6 and 8.2, since #22302.
This PR aim to harvest all the new improvement by enable more cudnn kernel. The workload checking logic can also be simplified now.
To keep the change simple, I kept things before cudnn 8.2 unchanged.
Similar to #22302, I used a script here to benchmark. Both run are using cudnn 8.2
One enhancement I did to the script is switch to event based timing. With warmup kernels to fill the launch queue ahead, this should give us accurate kernel timing even in CPU launch bound cases.
Here is A100 and V100 result sorted by speedup.
Book1.xlsx
Result highlights:
Newly turned on 5x5 cudnn kernel show up to 6x speedup.
Close to half of test sizes show >10% speedup.
Fixed some corner cases that previously caused 15-20x slowdown.
Only slowdown a handful of cases(~10 out of >1000)