Skip to content

Conversation

@FDecaYed
Copy link
Contributor

@FDecaYed FDecaYed commented May 21, 2021

There are multiple improvement of depthwise convolution speed in cudnn between 7.6 and 8.2, since #22302.
This PR aim to harvest all the new improvement by enable more cudnn kernel. The workload checking logic can also be simplified now.
To keep the change simple, I kept things before cudnn 8.2 unchanged.

Similar to #22302, I used a script here to benchmark. Both run are using cudnn 8.2
One enhancement I did to the script is switch to event based timing. With warmup kernels to fill the launch queue ahead, this should give us accurate kernel timing even in CPU launch bound cases.

Here is A100 and V100 result sorted by speedup.
Book1.xlsx

Result highlights:
Newly turned on 5x5 cudnn kernel show up to 6x speedup.
Close to half of test sizes show >10% speedup.
Fixed some corner cases that previously caused 15-20x slowdown.
Only slowdown a handful of cases(~10 out of >1000)

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 21, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit f4b5067 (more details on the Dr. CI page):



4 failures not recognized by patterns:

Job Step Action
CircleCI pytorch_ios_12_5_1_x86_64_coreml_build Update Homebrew 🔁 rerun
CircleCI pytorch_macos_10_13_py3_test Update Homebrew 🔁 rerun
CircleCI pytorch_ios_12_5_1_x86_64_full_jit_build Update Homebrew 🔁 rerun
CircleCI pytorch_macos_10_15_py3_build Update Homebrew 🔁 rerun

❄️ 2 failures tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_ios_12_5_1_x86_64_build (1/2)

Step: "Update Homebrew" (full log | diagnosis details | 🔁 rerun) ❄️

fatal: Could not read from remote repository.
Receiving objects:   0% (1/119)
Receiving objects:   1% (2/119)
Receiving objects:   2% (3/119)
Receiving objects:   3% (4/119)
Receiving objects:   4% (5/119)
Receiving objects:   5% (6/119)
Receiving objects:   6% (8/119)
Receiving objects:   7% (9/119)
Receiving objects:   8% (10/119)
Receiving objects:   9% (11/119)
Receiving objects:  10% (12/119)
Receiving objects:  11% (14/119)
Receiving objects:  12% (15/119)
Receiving objects:  13% (16/119)
Receiving objects:  14% (17/119)
Receiving objects:  15% (18/119)
Receiving objects:  16% (20/119)
Receiving objects:  17% (21/119)
Receiving objects:  18% (22/119)
Receiving objects:  19% (23/119)
Receiving objects:  20% (24/119)
Receiving objects:  21% (25/119)
Receiving objects:  22% (27/119)
Receiving objects:  23% (28/119)
Receiving objects:  24% (29/119)
Receiving objects:  25% (30/119)
Receiving objects:  26% (31/119)
Receiving objects:  27% (33/119)
Receiving objects:  28% (34/119)
Receiving objects:  29% (35/119)
Receiving objects:  30% (36/119)
Receiving objects:  31% (37/119)
Receiving objects:  32% (39/119)
Receiving objects:  33% (40/119)
Receiving objects:  34% (41/119)
Receiving objects:  35% (42/119)
Receiving objects:  36% (43/119)
Receiving objects:  37% (45/119)
Receiving objects:  38% (46/119)
Receiving objects:  39% (47/119)
Receiving objects:  40% (48/119)
Receiving objects:  41% (49/119)
Receiving objects:  42% (50/119)
Receiving objects:  43% (52/119)
Receiving objects:  44% (53/119)
Receiving objects:  45% (54/119)
Receiving objects:  46% (55/119)
Receiving objects:  47% (56/119)
Receiving objects:  48% (58/119)
Receiving objects:  49% (59/119)
Receiving objects:  50% (60/119)
Receiving objects:  51% (61/119)
Receiving objects:  52% (62/119)
Receiving objects:  53% (64/119)
Receiving objects:  54% (65/119)
Receiving objects:  55% (66/119)
Receiving objects:  56% (67/119)
Receiving objects:  57% (68/119)
Receiving objects:  58% (70/119)
Receiving objects:  59% (71/119)
Receiving objects:  60% (72/119)
Receiving objects:  61% (73/119)
Receiving objects:  62% (74/119)
Receiving objects:  63% (75/119)
Receiving objects:  64% (77/119)
Receiving objects:  65% (78/119)
Receiving objects:  66% (79/119)
Receiving objects:  67% (80/119)
Receiving objects:  68% (81/119)
Receiving objects:  69% (83/119)
Receiving objects:  70% (84/119)
Receiving objects:  71% (85/119)
Receiving objects:  72% (86/119)
Receiving objects:  73% (87/119)
Receiving objects:  74% (89/119)
Receiving objects:  75% (90/119)
Receiving objects:  76% (91/119)
Receiving objects:  77% (92/119)
Receiving objects:  78% (93/119)
Receiving objects:  79% (95/119)
Receiving objects:  80% (96/119)
Receiving objects:  81% (97/119)
Receiving objects:  82% (98/119)
Receiving objects:  83% (99/119)
Receiving objects:  84% (100/119)
Receiving objects:  85% (102/119)
Receiving objects:  86% (103/119)
Receiving objects:  87% (104/119)
Receiving objects:  88% (105/119)
Receiving objects:  89% (106/119)
Receiving objects:  90% (108/119)
Receiving objects:  91% (109/119)
Receiving objects:  92% (110/119)
Receiving objects:  93% (111/119)
Receiving objects:  94% (112/119)
Receiving objects:  95% (114/119)
Receiving objects:  96% (115/119)
Receiving objects:  97% (116/119)
Receiving objects:  98% (117/119)
Receiving objects:  99% (118/119)
Receiving objects: 100% (119/119)
Receiving objects: 100% (119/119), 28.33 KiB | 5.67 MiB/s, done.
Resolving deltas:   0% (0/104)
Resolving deltas:   1% (2/104)
Resolving deltas:   2% (3/104)
Resolving deltas:   3% (4/104)
Resolving deltas:   4% (5/104)
Resolving deltas:   5% (6/104)
Resolving deltas:   6% (7/104)
Resolving deltas:   7% (8/104)
Resolving deltas:   8% (9/104)
Resolving deltas:   9% (10/104)
Resolving deltas:  10% (11/104)
Resolving deltas:  11% (12/104)
Resolving deltas:  12% (13/104)
Resolving deltas:  13% (14/104)
Resolving deltas:  14% (15/104)
Resolving deltas:  15% (16/104)
Resolving deltas:  16% (17/104)
Resolving deltas:  17% (18/104)
Resolving deltas:  18% (19/104)
Resolving deltas:  19% (20/104)
Resolving deltas:  20% (21/104)
Resolving deltas:  21% (22/104)
Resolving deltas:  22% (23/104)
Resolving deltas:  23% (24/104)
Resolving deltas:  24% (25/104)
Resolving deltas:  25% (26/104)
Resolving deltas:  26% (28/104)
Resolving deltas:  27% (29/104)
Resolving deltas:  28% (30/104)
Resolving deltas:  29% (31/104)
Resolving deltas:  30% (32/104)
Resolving deltas:  31% (33/104)
Resolving deltas:  32% (34/104)
Resolving deltas:  33% (35/104)
Resolving deltas:  34% (36/104)
Resolving deltas:  35% (37/104)
Resolving deltas:  36% (38/104)
Resolving deltas:  37% (39/104)
Resolving deltas:  38% (40/104)
Resolving deltas:  39% (41/104)
Resolving deltas:  40% (42/104)
Resolving deltas:  41% (43/104)
Resolving deltas:  42% (44/104)
Resolving deltas:  43% (45/104)
Resolving deltas:  44% (46/104)
Resolving deltas:  45% (47/104)
Resolving deltas:  46% (48/104)
Resolving deltas:  47% (49/104)
Resolving deltas:  48% (50/104)
Resolving deltas:  49% (51/104)
Resolving deltas:  50% (52/104)
Resolving deltas:  51% (54/104)
Resolving deltas:  52% (55/104)
Resolving deltas:  53% (56/104)
Resolving deltas:  54% (57/104)
Resolving deltas:  55% (58/104)
Resolving deltas:  56% (59/104)
Resolving deltas:  57% (60/104)
Resolving deltas:  58% (61/104)
Resolving deltas:  59% (62/104)
Resolving deltas:  60% (63/104)
Resolving deltas:  61% (64/104)
Resolving deltas:  62% (65/104)
Resolving deltas:  63% (66/104)
Resolving deltas:  64% (67/104)
Resolving deltas:  65% (68/104)
Resolving deltas:  66% (69/104)
Resolving deltas:  67% (70/104)
Resolving deltas:  68% (71/104)
Resolving deltas:  69% (72/104)
Resolving deltas:  70% (73/104)
Resolving deltas:  71% (74/104)
Resolving deltas:  72% (75/104)
Resolving deltas:  73% (76/104)
Resolving deltas:  74% (77/104)
Resolving deltas:  75% (78/104)
Resolving deltas:  76% (80/104)
Resolving deltas:  77% (81/104)
Resolving deltas:  78% (82/104)
Resolving deltas:  79% (83/104)
Resolving deltas:  80% (84/104)
Resolving deltas:  81% (85/104)
Resolving deltas:  82% (86/104)
Resolving deltas:  83% (87/104)
Resolving deltas:  84% (88/104)
Resolving deltas:  85% (89/104)
Resolving deltas:  86% (90/104)
Resolving deltas:  87% (91/104)
Resolving deltas:  88% (92/104)
Resolving deltas:  89% (93/104)
Resolving deltas:  90% (94/104)
Resolving deltas:  91% (95/104)
Resolving deltas:  92% (96/104)
Resolving deltas:  93% (97/104)
Resolving deltas:  94% (98/104)
Resolving deltas:  95% (99/104)
Resolving deltas:  96% (100/104)
Resolving deltas:  97% (101/104)
Resolving deltas:  98% (102/104)
Resolving deltas:  99% (103/104)
Resolving deltas: 100% (104/104)
Resolving deltas: 100% (104/104), completed with 93 local objects.
From ssh://github.com/Homebrew/homebrew-cask-versions
 + fc8477195...1f30da4a0 master     -> origin/master  (forced update)
+ git reset --hard origin/master
HEAD is now at 1f30da4a0 Update dotnet-preview and dotnet-sdk-preview to 6.0.0-RC2 (#12196)
+ for path in '$(find /usr/local/Homebrew -type d -name .git)'
+ cd /usr/local/Homebrew/Library/Taps/homebrew/homebrew-core/.git/..
+ git fetch --depth=1 origin
ssh: connect to host github.com port 22: Operation timed out

fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.


Exited with code exit status 128

See CircleCI build pytorch_macos_10_13_py3_lite_interpreter_build_test (2/2)

Step: "Update Homebrew" (full log | diagnosis details | 🔁 rerun) ❄️

fatal: Could not read from remote repository.
remote: Total 5 (delta 3), reused 0 (delta 0), pack-reused 0        
Unpacking objects:  20% (1/5)
Unpacking objects:  40% (2/5)
Unpacking objects:  60% (3/5)
Unpacking objects:  80% (4/5)
Unpacking objects: 100% (5/5)
Unpacking objects: 100% (5/5), 1.18 KiB | 303.00 KiB/s, done.
From ssh://github.com/AdoptOpenJDK/homebrew-openjdk
 + 3e6ed3c...f6e8c97 master     -> origin/master  (forced update)
+ git reset --hard origin/master
HEAD is now at f6e8c97 Add deprecation notice to direct users to temurin (#538)
+ for path in '$(find /usr/local/Homebrew -type d -name .git)'
+ cd /usr/local/Homebrew/.git/..
+ git fetch --depth=1 origin
ssh: connect to host github.com port 22: Operation timed out

fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.


Exited with code exit status 128


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@H-Huang H-Huang requested a review from jbschlosser May 22, 2021 04:57
@H-Huang H-Huang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 22, 2021
@FDecaYed
Copy link
Contributor Author

Updated result spreadsheet with V100 result. @ptrblck

@jbschlosser jbschlosser requested review from ngimel and removed request for jbschlosser June 10, 2021 14:24
@FDecaYed FDecaYed force-pushed the deyuf/update_depthwise_cudnn branch from 414c0ac to f4b5067 Compare October 13, 2021 11:48
@pytorch-probot
Copy link

pytorch-probot bot commented Oct 13, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/FDecaYed/pytorch/blob/d4916cbc8eb92fb4d14a81ece1c5750bdaf397ec/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/xla ✅ triggered
linux-vulkan-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers ✅ triggered
linux-xenial-py3.6-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx ✅ triggered
linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/win ✅ triggered
Skipped Workflows
libtorch-linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
puretorch-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@FDecaYed
Copy link
Contributor Author

@ptrblck Let's try to merge this now that cudnn 8.2 is more widely used.
I've rebased the PR on top of master and made following 2 changes:

  • restructured the code so that change is much more clear. code path for cudnn<8.2 remain untouched now
  • added support for 1D conv. turning on cudnn for stride==1 provide some benefit

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!
But generally, since networks should be in channels-last already, that shouldn't affect much?

@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@FDecaYed
Copy link
Contributor Author

Looks good, thanks! But generally, since networks should be in channels-last already, that shouldn't affect much?

Yes this is for NCHW since channel last already call cudnn. Whoever not running channel-last should see some benefit

fixed condition for 1D conv check
@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Oct 23, 2021
Summary:
There are multiple improvement of depthwise convolution speed in cudnn between 7.6 and 8.2, since #22302.
This PR aim to harvest all the new improvement by enable more cudnn kernel. The workload checking logic can also be simplified now.
To keep the change simple, I kept things before cudnn 8.2 unchanged.

Similar to #22302, I used a script [here](https://gist.github.com/FDecaYed/e8ba98a95cd33697df2ace86fdb44897) to benchmark. Both run are using cudnn 8.2
One enhancement I did to the script is switch to event based timing. With warmup kernels to fill the launch queue ahead, this should give us accurate kernel timing even in CPU launch bound cases.

Here is A100 and V100 result sorted by speedup.
[Book1.xlsx](https://github.com/pytorch/pytorch/files/6530371/Book1.xlsx)

Result highlights:
Newly turned on 5x5 cudnn kernel show up to 6x speedup.
Close to half of test sizes show >10% speedup.
Fixed some corner cases that previously caused 15-20x slowdown.
Only slowdown a handful of cases(~10 out of >1000)

Pull Request resolved: #58749

Reviewed By: bdhirsh

Differential Revision: D31613199

Pulled By: ngimel

fbshipit-source-id: 883b58facad67ccd51dc9ab539368b4738d40398
langong347 pushed a commit that referenced this pull request Oct 25, 2021
Summary:
There are multiple improvement of depthwise convolution speed in cudnn between 7.6 and 8.2, since #22302.
This PR aim to harvest all the new improvement by enable more cudnn kernel. The workload checking logic can also be simplified now.
To keep the change simple, I kept things before cudnn 8.2 unchanged.

Similar to #22302, I used a script [here](https://gist.github.com/FDecaYed/e8ba98a95cd33697df2ace86fdb44897) to benchmark. Both run are using cudnn 8.2
One enhancement I did to the script is switch to event based timing. With warmup kernels to fill the launch queue ahead, this should give us accurate kernel timing even in CPU launch bound cases.

Here is A100 and V100 result sorted by speedup.
[Book1.xlsx](https://github.com/pytorch/pytorch/files/6530371/Book1.xlsx)

Result highlights:
Newly turned on 5x5 cudnn kernel show up to 6x speedup.
Close to half of test sizes show >10% speedup.
Fixed some corner cases that previously caused 15-20x slowdown.
Only slowdown a handful of cases(~10 out of >1000)

Pull Request resolved: #58749

Reviewed By: bdhirsh

Differential Revision: D31613199

Pulled By: ngimel

fbshipit-source-id: 883b58facad67ccd51dc9ab539368b4738d40398
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants