Add aten mkldnn conv2d backward operator #20567

XiaobingSuper · 2019-05-16T02:28:59Z

mkldnn backward ops list:

(Add aten mkldnn conv2d backward operator #20567) Add aten mkldnn conv2d backward operator 💛
(Add aten mkldnn backward ops: relu, linear and reshape #20570) Add aten mkldnn backward ops: relu, linear and reshape 💛
( Add aten mkldnn backward ops: max_pool2d, avg_pool2d and adaptive_av… #20571) Add aten mkldnn backward ops: max_pool2d, avg_pool2d and adaptive_avg_poo2d 💛
(Add aten mkldnn batchnorm backward operator #20572) Add aten mkldnn batchnorm backward operator 💛
(Add aten mkldnn zero_ operator #20573) Add aten mkldnn zero_ operator💚
(Add mkldnn mul operator #20575) Add mkldnn mul operator 💚

Enable mkldnn backward which can improve the traning performance about 2x for mode resnext101.

mingfeima · 2019-05-16T06:13:04Z

@bddppq Hi, the backward integration is done and please review the code.
Feel free to invite more reviewers.

XiaobingSuper · 2019-05-16T07:06:40Z

Summary: ### mkldnn backward ops list: - [ ] \(#20567) Add aten mkldnn conv2d backward operator 💛 - [ ] \(#20570) Add aten mkldnn backward ops: relu, linear and reshape 💛 - [ ] \(#20571) Add aten mkldnn backward ops: max_pool2d, avg_pool2d and adaptive_avg_poo2d 💛 - [ ] \(#20572) Add aten mkldnn batchnorm backward operator 💛 - [ ] \(#20573) Add aten mkldnn zero_ operator:yellow_heart: - [ ] \(#20575) Add mkldnn mul operator 💛 Pull Request resolved: #20575 Differential Revision: D15799529 Pulled By: bddppq fbshipit-source-id: 4887d8ef1a0e316ad9db199b657d9481fc13e486

Summary: ### mkldnn backward ops list: - [ ] \(pytorch/pytorch#20567) Add aten mkldnn conv2d backward operator 💛 - [ ] \(pytorch/pytorch#20570) Add aten mkldnn backward ops: relu, linear and reshape 💛 - [ ] \(pytorch/pytorch#20571) Add aten mkldnn backward ops: max_pool2d, avg_pool2d and adaptive_avg_poo2d 💛 - [ ] \(pytorch/pytorch#20572) Add aten mkldnn batchnorm backward operator 💛 - [ ] \(pytorch/pytorch#20573) Add aten mkldnn zero_ operator:yellow_heart: - [ ] \(pytorch/pytorch#20575) Add mkldnn mul operator 💛 Pull Request resolved: pytorch/pytorch#20575 Differential Revision: D15799529 Pulled By: bddppq fbshipit-source-id: 4887d8ef1a0e316ad9db199b657d9481fc13e486

Summary: ### mkldnn backward ops list: - [ ] \(#20567) Add aten mkldnn conv2d backward operator 💛 - [ ] \(#20570) Add aten mkldnn backward ops: relu, linear and reshape 💛 - [ ] \(#20571) Add aten mkldnn backward ops: max_pool2d, avg_pool2d and adaptive_avg_poo2d 💛 - [ ] \(#20572) Add aten mkldnn batchnorm backward operator 💛 - [ ] \(#20573) Add aten mkldnn zero_ operator:yellow_heart: - [ ] \(#20575) Add mkldnn mul operator 💚 Pull Request resolved: #20573 Differential Revision: D15820477 Pulled By: bddppq fbshipit-source-id: 35d95f5b4e013c8db1911f52148550a2e40a2e68

Summary: ### mkldnn backward ops list: - [ ] \(pytorch/pytorch#20567) Add aten mkldnn conv2d backward operator 💛 - [ ] \(pytorch/pytorch#20570) Add aten mkldnn backward ops: relu, linear and reshape 💛 - [ ] \(pytorch/pytorch#20571) Add aten mkldnn backward ops: max_pool2d, avg_pool2d and adaptive_avg_poo2d 💛 - [ ] \(pytorch/pytorch#20572) Add aten mkldnn batchnorm backward operator 💛 - [ ] \(pytorch/pytorch#20573) Add aten mkldnn zero_ operator:yellow_heart: - [ ] \(pytorch/pytorch#20575) Add mkldnn mul operator 💚 Pull Request resolved: pytorch/pytorch#20573 Differential Revision: D15820477 Pulled By: bddppq fbshipit-source-id: 35d95f5b4e013c8db1911f52148550a2e40a2e68

dzhulgakov

In general looks ok. Two questions:

CI is failing
Did you benchmark potential perf change from introducing ideep in the backward path? it should be ok, but best to double check

mingfeima · 2019-07-16T00:32:21Z

@dzhulgakov so what is the proposed system config for training? double sockets or single socket?
@XiaobingSuper post benchmark numbers here.

However, we have a known issue here that memory buffer is not cached (output tensors need to be allocated everytime), this is going to add significant overhead for the training scenario.

Customized memory allocator is a solution but seems to be against the current design. Suggestions?

jgong5 · 2019-07-16T00:52:17Z

However, we have a known issue here that memory buffer is not cached (output tensors need to be allocated everytime), this is going to add significant overhead for the training scenario.

Customized memory allocator is a solution but seems to be against the current design. Suggestions?

@dzhulgakov The performance issue @mingfeima pointed out is caused by clear_page overhead on large buffer allocation with malloc. This happens when PyTorch allocates large activation buffer which is a typical case during training since the batch size is usually large. As you know, whether clear_page is triggered is controlled by M_MMAP_THRESHOLD (http://man7.org/linux/man-pages/man3/mallopt.3.html) which default to 128KB and on 64-bit system, the upper bound is 32MB which is not a big number w.r.t. typical training workloads. I notice that PyTorch has caching allocator in the GPU path and probably CPU can use similar idea to avoid calling malloc again and again for output tensors. Currently, we have a quick implementation of caching allocator inside ideep and @XiaobingSuper will share the training performance numbers with and without caching allocator FYI.

XiaobingSuper · 2019-07-16T13:43:15Z

Runing this benchmark on SKX-6148 with 2 sockets, we can get 2X performance improvement using mkldnn compared to native path:

models	native path(img/s)	mkldnn path(img/s)	speed up
resnet50	28.27	67.81	239.87%
resnext101	8.73	19.71	225.77%

The following is the details logs:

Native path:

### using OMP_NUM_THREADS=40
### using KMP_AFFINITY=granularity=fine,compact,1,0
### using KMP_BLOCKTIME=1

Running on device:  Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Running on torch: 1.2.0a0+69e3229
Running on torchvision: 0.3.0a0+8837e0e

ModelType: resnet50, Kernels: nn Input shape: 128x3x224x224
nn                              :forward:    1803.12 (ms)      70.99 (imgs/s)
nn                             :backward:    2725.13 (ms)
nn                               :update:       4.72 (ms)
nn                                :total:    4528.25 (ms)      28.27 (imgs/s)
ModelType: resnext101_32x8d, Kernels: nn Input shape: 128x3x224x224
nn                              :forward:    6042.85 (ms)      21.18 (imgs/s)
nn                             :backward:    8624.28 (ms)
nn                               :update:      13.07 (ms)
nn                                :total:   14667.13 (ms)       8.73 (imgs/s)

MKLDNN path:

### using OMP_NUM_THREADS=40
### using KMP_AFFINITY=granularity=fine,compact,1,0
### using KMP_BLOCKTIME=1

Running on device:  Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Running on torch: 1.2.0a0+69e3229
Running on torchvision: 0.3.0a0+8837e0e

ModelType: resnet50, Kernels: nn Input shape: 128x3x224x224
nn                              :forward:     659.40 (ms)     194.12 (imgs/s)
nn                             :backward:    1228.16 (ms)
nn                               :update:       4.69 (ms)
nn                                :total:    1887.55 (ms)      67.81 (imgs/s)
ModelType: resnext101_32x8d, Kernels: nn Input shape: 128x3x224x224
nn                              :forward:    2259.24 (ms)      56.66 (imgs/s)
nn                             :backward:    4234.59 (ms)
nn                               :update:      13.00 (ms)
nn                                :total:    6493.83 (ms)      19.71 (imgs/s)

Next step, I will share the performance using caching allocator. Thanks!

XiaobingSuper · 2019-07-18T04:38:22Z

Adding the performance using idepp caching allocator on SKX-6148 with 2 sockets, there has 1.36x performance improvement at least for inference with large batch size and has 1.32x performance improvement at least for training.

For inference using MKLDNN with large batch size:

models	MKLDNN_default allocator (img/s)	MKLDNN_caching allocator(img/s)	speed up
resnet50	196.91	289.69	147.12%
resnext101	58.23	79.77	136.99%

For training using MKLDNN:

models	MKLDNN_default allocator (img/s)	MKLDNN_caching allocator(img/s)	speed up
resnet50	67.81	96.05	141.65%
resnext101	19.71	26.2	132.93%

dzhulgakov · 2019-07-22T00:13:38Z

Adding CPU caching allocator is something we also discussed. The problem is mostly alleviated by using better malloc - e.g. using jemalloc (http://jemalloc.net/jemalloc.3.html). Unfortunately, I don't think there's a safe way to package it with prebuilt pytorch binaries - it has to be preloaded for the entire Python process.

Overall, this PR looks pretty good to me, also cc @zheng-xq to take a look

XiaobingSuper · 2019-08-14T04:58:02Z

@zheng-xq , can you help review it?

XiaobingSuper · 2019-09-06T02:17:49Z

@bddppq, can you help review this code, perharps we can first merge this PR which can unify the mkldnn convolution code. Thanks!

bddppq · 2019-09-06T06:45:40Z

@pytorchbot rebase this please

XiaobingSuper · 2019-10-12T02:12:57Z

@VitalyFedyunin, I rebase the code again, can you help review it when you have free time? Thanks!

XiaobingSuper · 2019-10-12T07:49:10Z

@VitalyFedyunin, the test failed cases are not related to this PR. Thanks!

XiaobingSuper · 2019-10-17T00:44:33Z

@VitalyFedyunin

Jianhui-Li · 2019-11-07T18:26:39Z

@dzhulgakov regarding the jemalloc. We tried out Jemalloc and TCmalloc, but both creates extra dependencies and doesn’t support NUMA. TF expose NUMA as subdevice and user can do NUMA-aware memory allocation. The memory allocator makes big difference when running trainning or offline inference with large batch size, since malloc may trigger large overhead since the malloc of large chunk memory causes clear page overhead. We observed +30% benefit on throughput. Are there any plan for Pytorch to support similar capability, if not, do you think it would work for us to implement one and PR it?

XiaobingSuper · 2020-04-07T03:23:22Z

This PR reopened in #36121.

…backward operator.

refs:pytorch#14 feat: [v1.5.0][pytorch#20567] Add aten mkldnn conv2d backward operator. See merge request postk_dl/pytorch!12

pytorchbot added module: autograd Related to torch.autograd, and the autograd engine in general module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration module: operators labels May 16, 2019

XiaobingSuper mentioned this pull request May 16, 2019

Add mkldnn mul operator #20575

Closed

6 tasks

li-roy added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 16, 2019

ezyang added the open source label Jun 5, 2019

XiaobingSuper force-pushed the mkldnn_conv_backward branch from a6870f6 to 35ef05b Compare June 13, 2019 04:46

gottbrath requested review from bddppq and dzhulgakov June 13, 2019 17:09

gottbrath requested a review from gchanan June 25, 2019 16:22

dzhulgakov requested changes Jul 15, 2019

View reviewed changes

XiaobingSuper force-pushed the mkldnn_conv_backward branch from 35ef05b to d19eaab Compare July 16, 2019 02:48

XiaobingSuper force-pushed the mkldnn_conv_backward branch from d19eaab to 1db67f7 Compare July 18, 2019 05:08

dzhulgakov requested a review from zheng-xq July 22, 2019 00:08

jamesr66a mentioned this pull request Aug 5, 2019

[MKLDNN] Corrupted malloc metadata in mkldnn_convolution_backward_input #23825

Closed

XiaobingSuper force-pushed the mkldnn_conv_backward branch from 1db67f7 to dc4a8bd Compare August 16, 2019 04:51

XiaobingSuper force-pushed the mkldnn_conv_backward branch from dc4a8bd to 82ac1f3 Compare September 6, 2019 02:14

XiaobingSuper requested a review from apaszke as a code owner September 6, 2019 02:14

XiaobingSuper force-pushed the mkldnn_conv_backward branch 2 times, most recently from 82ac1f3 to 704376d Compare September 10, 2019 01:56

XiaobingSuper force-pushed the mkldnn_conv_backward branch from 704376d to 3144fc6 Compare October 12, 2019 02:11

XiaobingSuper force-pushed the mkldnn_conv_backward branch 3 times, most recently from 86c3d42 to 69e75e9 Compare October 28, 2019 07:04

Add aten mkldnn conv2d backward operator

66d8a92

XiaobingSuper force-pushed the mkldnn_conv_backward branch from 69e75e9 to 66d8a92 Compare October 28, 2019 07:06

jgong5 mentioned this pull request Jan 23, 2020

Upgrade MKL-DNN to DNNL v1.2 #32422

Closed

XiaobingSuper mentioned this pull request Apr 7, 2020

enable mkldnn conv2d backward to support mkldnn tensor input #36121

Closed

XiaobingSuper closed this Apr 7, 2020

taka-sawada pushed a commit to fujitsu/pytorch that referenced this pull request Nov 9, 2020

refs:pytorch#14 feat: [v1.5.0][pytorch#20567] Add aten mkldnn conv2d …

9bdfd57

…backward operator.

taka-sawada pushed a commit to fujitsu/pytorch that referenced this pull request Nov 9, 2020

Merge branch 'feature/pytorch#14' into 'fj151612_dev_v150'

b9e0b46

refs:pytorch#14 feat: [v1.5.0][pytorch#20567] Add aten mkldnn conv2d backward operator. See merge request postk_dl/pytorch!12

taka-sawada pushed a commit to fujitsu/pytorch that referenced this pull request Dec 2, 2020

feat: [v1.6.0][pytorch#20567] Add aten mkldnn conv2d backward operator.

5147979

taka-sawada added a commit to fujitsu/pytorch that referenced this pull request Jan 29, 2021

feat: [v1.7.0][pytorch#20567] Add aten mkldnn conv2d backward operator.

28dad29

Add aten mkldnn conv2d backward operator #20567

Add aten mkldnn conv2d backward operator #20567

Uh oh!

Conversation

XiaobingSuper commented May 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mkldnn backward ops list:

Uh oh!

mingfeima commented May 16, 2019

Uh oh!

XiaobingSuper commented May 16, 2019

Uh oh!

dzhulgakov left a comment

Choose a reason for hiding this comment

Uh oh!

mingfeima commented Jul 16, 2019

Uh oh!

jgong5 commented Jul 16, 2019

Uh oh!

XiaobingSuper commented Jul 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XiaobingSuper commented Jul 18, 2019

Uh oh!

dzhulgakov commented Jul 22, 2019

Uh oh!

XiaobingSuper commented Aug 14, 2019

Uh oh!

XiaobingSuper commented Sep 6, 2019

Uh oh!

bddppq commented Sep 6, 2019

Uh oh!

XiaobingSuper commented Oct 12, 2019

Uh oh!

XiaobingSuper commented Oct 12, 2019

Uh oh!

XiaobingSuper commented Oct 17, 2019

Uh oh!

Jianhui-Li commented Nov 7, 2019

Uh oh!

XiaobingSuper commented Apr 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

XiaobingSuper commented May 16, 2019 •

edited

Loading

XiaobingSuper commented Jul 16, 2019 •

edited

Loading