enable better depthwise conv perf on cudnn 8.2+ #58749

FDecaYed · 2021-05-21T13:20:49Z

There are multiple improvement of depthwise convolution speed in cudnn between 7.6 and 8.2, since #22302.
This PR aim to harvest all the new improvement by enable more cudnn kernel. The workload checking logic can also be simplified now.
To keep the change simple, I kept things before cudnn 8.2 unchanged.

Similar to #22302, I used a script here to benchmark. Both run are using cudnn 8.2
One enhancement I did to the script is switch to event based timing. With warmup kernels to fill the launch queue ahead, this should give us accurate kernel timing even in CPU launch bound cases.

Here is A100 and V100 result sorted by speedup.
Book1.xlsx

Result highlights:
Newly turned on 5x5 cudnn kernel show up to 6x speedup.
Close to half of test sizes show >10% speedup.
Fixed some corner cases that previously caused 15-20x slowdown.
Only slowdown a handful of cases(~10 out of >1000)

facebook-github-bot · 2021-05-21T13:20:55Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/58749
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit f4b5067 (more details on the Dr. CI page):

4/6 failures introduced in this PR
2/6 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

4 failures not recognized by patterns:

Job	Step	Action
^{pytorch_ios_12_5_1_x86_64_coreml_build}	^{Update Homebrew}	🔁 rerun
^{pytorch_macos_10_13_py3_test}	^{Update Homebrew}	🔁 rerun
^{pytorch_ios_12_5_1_x86_64_full_jit_build}	^{Update Homebrew}	🔁 rerun
^{pytorch_macos_10_15_py3_build}	^{Update Homebrew}	🔁 rerun

❄️ 2 failures tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_ios_12_5_1_x86_64_build (1/2)

Step: "Update Homebrew" (full log | diagnosis details | 🔁 rerun) ❄️

fatal: Could not read from remote repository.

Receiving objects:   0% (1/119)
Receiving objects:   1% (2/119)
Receiving objects:   2% (3/119)
Receiving objects:   3% (4/119)
Receiving objects:   4% (5/119)
Receiving objects:   5% (6/119)
Receiving objects:   6% (8/119)
Receiving objects:   7% (9/119)
Receiving objects:   8% (10/119)
Receiving objects:   9% (11/119)
Receiving objects:  10% (12/119)
Receiving objects:  11% (14/119)
Receiving objects:  12% (15/119)
Receiving objects:  13% (16/119)
Receiving objects:  14% (17/119)
Receiving objects:  15% (18/119)
Receiving objects:  16% (20/119)
Receiving objects:  17% (21/119)
Receiving objects:  18% (22/119)
Receiving objects:  19% (23/119)
Receiving objects:  20% (24/119)
Receiving objects:  21% (25/119)
Receiving objects:  22% (27/119)
Receiving objects:  23% (28/119)
Receiving objects:  24% (29/119)
Receiving objects:  25% (30/119)
Receiving objects:  26% (31/119)
Receiving objects:  27% (33/119)
Receiving objects:  28% (34/119)
Receiving objects:  29% (35/119)
Receiving objects:  30% (36/119)
Receiving objects:  31% (37/119)
Receiving objects:  32% (39/119)
Receiving objects:  33% (40/119)
Receiving objects:  34% (41/119)
Receiving objects:  35% (42/119)
Receiving objects:  36% (43/119)
Receiving objects:  37% (45/119)
Receiving objects:  38% (46/119)
Receiving objects:  39% (47/119)
Receiving objects:  40% (48/119)
Receiving objects:  41% (49/119)
Receiving objects:  42% (50/119)
Receiving objects:  43% (52/119)
Receiving objects:  44% (53/119)
Receiving objects:  45% (54/119)
Receiving objects:  46% (55/119)
Receiving objects:  47% (56/119)
Receiving objects:  48% (58/119)
Receiving objects:  49% (59/119)
Receiving objects:  50% (60/119)
Receiving objects:  51% (61/119)
Receiving objects:  52% (62/119)
Receiving objects:  53% (64/119)
Receiving objects:  54% (65/119)
Receiving objects:  55% (66/119)
Receiving objects:  56% (67/119)
Receiving objects:  57% (68/119)
Receiving objects:  58% (70/119)
Receiving objects:  59% (71/119)
Receiving objects:  60% (72/119)
Receiving objects:  61% (73/119)
Receiving objects:  62% (74/119)
Receiving objects:  63% (75/119)
Receiving objects:  64% (77/119)
Receiving objects:  65% (78/119)
Receiving objects:  66% (79/119)
Receiving objects:  67% (80/119)
Receiving objects:  68% (81/119)
Receiving objects:  69% (83/119)
Receiving objects:  70% (84/119)
Receiving objects:  71% (85/119)
Receiving objects:  72% (86/119)
Receiving objects:  73% (87/119)
Receiving objects:  74% (89/119)
Receiving objects:  75% (90/119)
Receiving objects:  76% (91/119)
Receiving objects:  77% (92/119)
Receiving objects:  78% (93/119)
Receiving objects:  79% (95/119)
Receiving objects:  80% (96/119)
Receiving objects:  81% (97/119)
Receiving objects:  82% (98/119)
Receiving objects:  83% (99/119)
Receiving objects:  84% (100/119)
Receiving objects:  85% (102/119)
Receiving objects:  86% (103/119)
Receiving objects:  87% (104/119)
Receiving objects:  88% (105/119)
Receiving objects:  89% (106/119)
Receiving objects:  90% (108/119)
Receiving objects:  91% (109/119)
Receiving objects:  92% (110/119)
Receiving objects:  93% (111/119)
Receiving objects:  94% (112/119)
Receiving objects:  95% (114/119)
Receiving objects:  96% (115/119)
Receiving objects:  97% (116/119)
Receiving objects:  98% (117/119)
Receiving objects:  99% (118/119)
Receiving objects: 100% (119/119)
Receiving objects: 100% (119/119), 28.33 KiB | 5.67 MiB/s, done.
Resolving deltas:   0% (0/104)
Resolving deltas:   1% (2/104)
Resolving deltas:   2% (3/104)
Resolving deltas:   3% (4/104)
Resolving deltas:   4% (5/104)
Resolving deltas:   5% (6/104)
Resolving deltas:   6% (7/104)
Resolving deltas:   7% (8/104)
Resolving deltas:   8% (9/104)
Resolving deltas:   9% (10/104)
Resolving deltas:  10% (11/104)
Resolving deltas:  11% (12/104)
Resolving deltas:  12% (13/104)
Resolving deltas:  13% (14/104)
Resolving deltas:  14% (15/104)
Resolving deltas:  15% (16/104)
Resolving deltas:  16% (17/104)
Resolving deltas:  17% (18/104)
Resolving deltas:  18% (19/104)
Resolving deltas:  19% (20/104)
Resolving deltas:  20% (21/104)
Resolving deltas:  21% (22/104)
Resolving deltas:  22% (23/104)
Resolving deltas:  23% (24/104)
Resolving deltas:  24% (25/104)
Resolving deltas:  25% (26/104)
Resolving deltas:  26% (28/104)
Resolving deltas:  27% (29/104)
Resolving deltas:  28% (30/104)
Resolving deltas:  29% (31/104)
Resolving deltas:  30% (32/104)
Resolving deltas:  31% (33/104)
Resolving deltas:  32% (34/104)
Resolving deltas:  33% (35/104)
Resolving deltas:  34% (36/104)
Resolving deltas:  35% (37/104)
Resolving deltas:  36% (38/104)
Resolving deltas:  37% (39/104)
Resolving deltas:  38% (40/104)
Resolving deltas:  39% (41/104)
Resolving deltas:  40% (42/104)
Resolving deltas:  41% (43/104)
Resolving deltas:  42% (44/104)
Resolving deltas:  43% (45/104)
Resolving deltas:  44% (46/104)
Resolving deltas:  45% (47/104)
Resolving deltas:  46% (48/104)
Resolving deltas:  47% (49/104)
Resolving deltas:  48% (50/104)
Resolving deltas:  49% (51/104)
Resolving deltas:  50% (52/104)
Resolving deltas:  51% (54/104)
Resolving deltas:  52% (55/104)
Resolving deltas:  53% (56/104)
Resolving deltas:  54% (57/104)
Resolving deltas:  55% (58/104)
Resolving deltas:  56% (59/104)
Resolving deltas:  57% (60/104)
Resolving deltas:  58% (61/104)
Resolving deltas:  59% (62/104)
Resolving deltas:  60% (63/104)
Resolving deltas:  61% (64/104)
Resolving deltas:  62% (65/104)
Resolving deltas:  63% (66/104)
Resolving deltas:  64% (67/104)
Resolving deltas:  65% (68/104)
Resolving deltas:  66% (69/104)
Resolving deltas:  67% (70/104)
Resolving deltas:  68% (71/104)
Resolving deltas:  69% (72/104)
Resolving deltas:  70% (73/104)
Resolving deltas:  71% (74/104)
Resolving deltas:  72% (75/104)
Resolving deltas:  73% (76/104)
Resolving deltas:  74% (77/104)
Resolving deltas:  75% (78/104)
Resolving deltas:  76% (80/104)
Resolving deltas:  77% (81/104)
Resolving deltas:  78% (82/104)
Resolving deltas:  79% (83/104)
Resolving deltas:  80% (84/104)
Resolving deltas:  81% (85/104)
Resolving deltas:  82% (86/104)
Resolving deltas:  83% (87/104)
Resolving deltas:  84% (88/104)
Resolving deltas:  85% (89/104)
Resolving deltas:  86% (90/104)
Resolving deltas:  87% (91/104)
Resolving deltas:  88% (92/104)
Resolving deltas:  89% (93/104)
Resolving deltas:  90% (94/104)
Resolving deltas:  91% (95/104)
Resolving deltas:  92% (96/104)
Resolving deltas:  93% (97/104)
Resolving deltas:  94% (98/104)
Resolving deltas:  95% (99/104)
Resolving deltas:  96% (100/104)
Resolving deltas:  97% (101/104)
Resolving deltas:  98% (102/104)
Resolving deltas:  99% (103/104)
Resolving deltas: 100% (104/104)
Resolving deltas: 100% (104/104), completed with 93 local objects.
From ssh://github.com/Homebrew/homebrew-cask-versions
 + fc8477195...1f30da4a0 master     -> origin/master  (forced update)
+ git reset --hard origin/master
HEAD is now at 1f30da4a0 Update dotnet-preview and dotnet-sdk-preview to 6.0.0-RC2 (#12196)
+ for path in '$(find /usr/local/Homebrew -type d -name .git)'
+ cd /usr/local/Homebrew/Library/Taps/homebrew/homebrew-core/.git/..
+ git fetch --depth=1 origin
ssh: connect to host github.com port 22: Operation timed out

fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.


Exited with code exit status 128

pytorch_macos_10_13_py3_lite_interpreter_build_test (2/2)

Step: "Update Homebrew" (full log | diagnosis details | 🔁 rerun) ❄️

fatal: Could not read from remote repository.

remote: Total 5 (delta 3), reused 0 (delta 0), pack-reused 0        
Unpacking objects:  20% (1/5)
Unpacking objects:  40% (2/5)
Unpacking objects:  60% (3/5)
Unpacking objects:  80% (4/5)
Unpacking objects: 100% (5/5)
Unpacking objects: 100% (5/5), 1.18 KiB | 303.00 KiB/s, done.
From ssh://github.com/AdoptOpenJDK/homebrew-openjdk
 + 3e6ed3c...f6e8c97 master     -> origin/master  (forced update)
+ git reset --hard origin/master
HEAD is now at f6e8c97 Add deprecation notice to direct users to temurin (#538)
+ for path in '$(find /usr/local/Homebrew -type d -name .git)'
+ cd /usr/local/Homebrew/.git/..
+ git fetch --depth=1 origin
ssh: connect to host github.com port 22: Operation timed out

fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.


Exited with code exit status 128

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

FDecaYed · 2021-05-24T07:03:42Z

Updated result spreadsheet with V100 result. @ptrblck

…n cudnn 8.2

pytorch-probot · 2021-10-13T11:48:39Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/FDecaYed/pytorch/blob/d4916cbc8eb92fb4d14a81ece1c5750bdaf397ec/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/xla`	✅ triggered
linux-vulkan-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`	✅ triggered
linux-xenial-py3.6-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`	✅ triggered
linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/win`	✅ triggered
Skipped Workflows
libtorch-linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	🚫 skipped
linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
puretorch-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:

# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

FDecaYed · 2021-10-13T12:14:52Z

@ptrblck Let's try to merge this now that cudnn 8.2 is more widely used.
I've rebased the PR on top of master and made following 2 changes:

restructured the code so that change is much more clear. code path for cudnn<8.2 remain untouched now
added support for 1D conv. turning on cudnn for stride==1 provide some benefit

ngimel

Looks good, thanks!
But generally, since networks should be in channels-last already, that shouldn't affect much?

aten/src/ATen/native/Convolution.cpp

facebook-github-bot · 2021-10-13T16:54:09Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

FDecaYed · 2021-10-14T09:21:51Z

Looks good, thanks! But generally, since networks should be in channels-last already, that shouldn't affect much?

Yes this is for NCHW since channel last already call cudnn. Whoever not running channel-last should see some benefit

fixed condition for 1D conv check

facebook-github-bot · 2021-10-22T16:19:42Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: There are multiple improvement of depthwise convolution speed in cudnn between 7.6 and 8.2, since #22302. This PR aim to harvest all the new improvement by enable more cudnn kernel. The workload checking logic can also be simplified now. To keep the change simple, I kept things before cudnn 8.2 unchanged. Similar to #22302, I used a script [here](https://gist.github.com/FDecaYed/e8ba98a95cd33697df2ace86fdb44897) to benchmark. Both run are using cudnn 8.2 One enhancement I did to the script is switch to event based timing. With warmup kernels to fill the launch queue ahead, this should give us accurate kernel timing even in CPU launch bound cases. Here is A100 and V100 result sorted by speedup. [Book1.xlsx](https://github.com/pytorch/pytorch/files/6530371/Book1.xlsx) Result highlights: Newly turned on 5x5 cudnn kernel show up to 6x speedup. Close to half of test sizes show >10% speedup. Fixed some corner cases that previously caused 15-20x slowdown. Only slowdown a handful of cases(~10 out of >1000) Pull Request resolved: #58749 Reviewed By: bdhirsh Differential Revision: D31613199 Pulled By: ngimel fbshipit-source-id: 883b58facad67ccd51dc9ab539368b4738d40398

facebook-github-bot added the cla signed label May 21, 2021

pytorchbot added the open source label May 21, 2021

H-Huang requested a review from jbschlosser May 22, 2021 04:57

H-Huang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 22, 2021

jbschlosser requested review from ngimel and removed request for jbschlosser June 10, 2021 14:24

FDecaYed and others added 2 commits October 12, 2021 00:12

add simplified workload checking logic and enable better 1d/2d perf o…

fd71d16

…n cudnn 8.2

updated conv1d logic

f4b5067

FDecaYed force-pushed the deyuf/update_depthwise_cudnn branch from 414c0ac to f4b5067 Compare October 13, 2021 11:48

pytorch-probot bot added the ciflow/default label Oct 13, 2021

ngimel approved these changes Oct 13, 2021

View reviewed changes

aten/src/ATen/native/Convolution.cpp Outdated Show resolved Hide resolved

Update Convolution.cpp

d4916cb

fixed condition for 1D conv check

facebook-github-bot closed this Oct 23, 2021

sudhakarsingh27 mentioned this pull request Nov 9, 2021

updated conv1d logic FDecaYed/pytorch#1

Closed

gderossi mentioned this pull request Dec 16, 2025

Add auto-generated cuDNN depthwise conv heuristic #170609

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

enable better depthwise conv perf on cudnn 8.2+ #58749

enable better depthwise conv perf on cudnn 8.2+ #58749

Uh oh!

FDecaYed commented May 21, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented May 21, 2021 •

edited

Loading

Uh oh!

FDecaYed commented May 24, 2021

Uh oh!

pytorch-probot bot commented Oct 13, 2021 •

edited

Loading

⚛️ CI Flow

Uh oh!

FDecaYed commented Oct 13, 2021

Uh oh!

ngimel left a comment

Uh oh!

Uh oh!

facebook-github-bot commented Oct 13, 2021

Uh oh!

FDecaYed commented Oct 14, 2021

Uh oh!

facebook-github-bot commented Oct 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

enable better depthwise conv perf on cudnn 8.2+ #58749

enable better depthwise conv perf on cudnn 8.2+ #58749

Uh oh!

Conversation

FDecaYed commented May 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented May 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

4 failures not recognized by patterns:

❄️ 2 failures tentatively classified as flaky

pytorch_ios_12_5_1_x86_64_build (1/2)

pytorch_macos_10_13_py3_lite_interpreter_build_test (2/2)

Uh oh!

FDecaYed commented May 24, 2021

Uh oh!

pytorch-probot bot commented Oct 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚛️ CI Flow

Uh oh!

FDecaYed commented Oct 13, 2021

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot commented Oct 13, 2021

Uh oh!

FDecaYed commented Oct 14, 2021

Uh oh!

facebook-github-bot commented Oct 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

FDecaYed commented May 21, 2021 •

edited

Loading

facebook-github-bot commented May 21, 2021 •

edited

Loading

pytorch-probot bot commented Oct 13, 2021 •

edited

Loading