Skip to content

Conversation

@XiaobingSuper
Copy link
Collaborator

@XiaobingSuper XiaobingSuper commented May 16, 2019

mkldnn backward ops list:

Enable mkldnn backward which can improve the traning performance about 2x for mode resnext101.

@mingfeima
Copy link
Collaborator

@bddppq Hi, the backward integration is done and please review the code.
Feel free to invite more reviewers.

@XiaobingSuper XiaobingSuper mentioned this pull request May 16, 2019
6 tasks
@XiaobingSuper
Copy link
Collaborator Author

@uyongw, @Jianhui-Li, @jgong5.

@li-roy li-roy added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 16, 2019
@XiaobingSuper XiaobingSuper force-pushed the mkldnn_conv_backward branch from a6870f6 to 35ef05b Compare June 13, 2019 04:46
facebook-github-bot pushed a commit that referenced this pull request Jun 13, 2019
Summary:
### mkldnn backward ops list:
 - [ ] \(#20567) Add aten mkldnn conv2d backward operator 💛
 - [ ] \(#20570) Add aten mkldnn backward ops: relu, linear and reshape 💛
 - [ ] \(#20571) Add aten mkldnn backward ops: max_pool2d, avg_pool2d and adaptive_avg_poo2d 💛
 - [ ] \(#20572) Add aten mkldnn batchnorm backward operator 💛
 - [ ] \(#20573) Add aten mkldnn zero_ operator:yellow_heart:
 - [ ] \(#20575) Add mkldnn mul operator 💛
Pull Request resolved: #20575

Differential Revision: D15799529

Pulled By: bddppq

fbshipit-source-id: 4887d8ef1a0e316ad9db199b657d9481fc13e486
zdevito pushed a commit to zdevito/ATen that referenced this pull request Jun 13, 2019
Summary:
### mkldnn backward ops list:
 - [ ] \(pytorch/pytorch#20567) Add aten mkldnn conv2d backward operator 💛
 - [ ] \(pytorch/pytorch#20570) Add aten mkldnn backward ops: relu, linear and reshape 💛
 - [ ] \(pytorch/pytorch#20571) Add aten mkldnn backward ops: max_pool2d, avg_pool2d and adaptive_avg_poo2d 💛
 - [ ] \(pytorch/pytorch#20572) Add aten mkldnn batchnorm backward operator 💛
 - [ ] \(pytorch/pytorch#20573) Add aten mkldnn zero_ operator:yellow_heart:
 - [ ] \(pytorch/pytorch#20575) Add mkldnn mul operator 💛
Pull Request resolved: pytorch/pytorch#20575

Differential Revision: D15799529

Pulled By: bddppq

fbshipit-source-id: 4887d8ef1a0e316ad9db199b657d9481fc13e486
@gottbrath gottbrath requested review from bddppq and dzhulgakov June 13, 2019 17:09
facebook-github-bot pushed a commit that referenced this pull request Jun 14, 2019
Summary:
### mkldnn backward ops list:
 - [ ] \(#20567) Add aten mkldnn conv2d backward operator 💛
 - [ ] \(#20570) Add aten mkldnn backward ops: relu, linear and reshape 💛
 - [ ] \(#20571) Add aten mkldnn backward ops: max_pool2d, avg_pool2d and adaptive_avg_poo2d 💛
 - [ ] \(#20572) Add aten mkldnn batchnorm backward operator 💛
 - [ ] \(#20573) Add aten mkldnn zero_ operator:yellow_heart:
 - [ ] \(#20575) Add mkldnn mul operator 💚
Pull Request resolved: #20573

Differential Revision: D15820477

Pulled By: bddppq

fbshipit-source-id: 35d95f5b4e013c8db1911f52148550a2e40a2e68
zdevito pushed a commit to zdevito/ATen that referenced this pull request Jun 14, 2019
Summary:
### mkldnn backward ops list:
 - [ ] \(pytorch/pytorch#20567) Add aten mkldnn conv2d backward operator 💛
 - [ ] \(pytorch/pytorch#20570) Add aten mkldnn backward ops: relu, linear and reshape 💛
 - [ ] \(pytorch/pytorch#20571) Add aten mkldnn backward ops: max_pool2d, avg_pool2d and adaptive_avg_poo2d 💛
 - [ ] \(pytorch/pytorch#20572) Add aten mkldnn batchnorm backward operator 💛
 - [ ] \(pytorch/pytorch#20573) Add aten mkldnn zero_ operator:yellow_heart:
 - [ ] \(pytorch/pytorch#20575) Add mkldnn mul operator 💚
Pull Request resolved: pytorch/pytorch#20573

Differential Revision: D15820477

Pulled By: bddppq

fbshipit-source-id: 35d95f5b4e013c8db1911f52148550a2e40a2e68
@gottbrath gottbrath requested a review from gchanan June 25, 2019 16:22
Copy link
Collaborator

@dzhulgakov dzhulgakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks ok. Two questions:

  • CI is failing
  • Did you benchmark potential perf change from introducing ideep in the backward path? it should be ok, but best to double check

@mingfeima
Copy link
Collaborator

@dzhulgakov so what is the proposed system config for training? double sockets or single socket?
@XiaobingSuper post benchmark numbers here.

However, we have a known issue here that memory buffer is not cached (output tensors need to be allocated everytime), this is going to add significant overhead for the training scenario.

Customized memory allocator is a solution but seems to be against the current design. Suggestions?

@jgong5
Copy link
Collaborator

jgong5 commented Jul 16, 2019

However, we have a known issue here that memory buffer is not cached (output tensors need to be allocated everytime), this is going to add significant overhead for the training scenario.

Customized memory allocator is a solution but seems to be against the current design. Suggestions?

@dzhulgakov The performance issue @mingfeima pointed out is caused by clear_page overhead on large buffer allocation with malloc. This happens when PyTorch allocates large activation buffer which is a typical case during training since the batch size is usually large. As you know, whether clear_page is triggered is controlled by M_MMAP_THRESHOLD (http://man7.org/linux/man-pages/man3/mallopt.3.html) which default to 128KB and on 64-bit system, the upper bound is 32MB which is not a big number w.r.t. typical training workloads. I notice that PyTorch has caching allocator in the GPU path and probably CPU can use similar idea to avoid calling malloc again and again for output tensors. Currently, we have a quick implementation of caching allocator inside ideep and @XiaobingSuper will share the training performance numbers with and without caching allocator FYI.

@XiaobingSuper XiaobingSuper force-pushed the mkldnn_conv_backward branch from 35ef05b to d19eaab Compare July 16, 2019 02:48
@XiaobingSuper
Copy link
Collaborator Author

XiaobingSuper commented Jul 16, 2019

Runing this benchmark on SKX-6148 with 2 sockets, we can get 2X performance improvement using mkldnn compared to native path:

models native path(img/s) mkldnn path(img/s) speed up
resnet50 28.27 67.81 239.87%
resnext101 8.73 19.71 225.77%

The following is the details logs:

  • Native path:
### using OMP_NUM_THREADS=40
### using KMP_AFFINITY=granularity=fine,compact,1,0
### using KMP_BLOCKTIME=1

Running on device:  Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Running on torch: 1.2.0a0+69e3229
Running on torchvision: 0.3.0a0+8837e0e

ModelType: resnet50, Kernels: nn Input shape: 128x3x224x224
nn                              :forward:    1803.12 (ms)      70.99 (imgs/s)
nn                             :backward:    2725.13 (ms)
nn                               :update:       4.72 (ms)
nn                                :total:    4528.25 (ms)      28.27 (imgs/s)
ModelType: resnext101_32x8d, Kernels: nn Input shape: 128x3x224x224
nn                              :forward:    6042.85 (ms)      21.18 (imgs/s)
nn                             :backward:    8624.28 (ms)
nn                               :update:      13.07 (ms)
nn                                :total:   14667.13 (ms)       8.73 (imgs/s)
  • MKLDNN path:
### using OMP_NUM_THREADS=40
### using KMP_AFFINITY=granularity=fine,compact,1,0
### using KMP_BLOCKTIME=1

Running on device:  Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Running on torch: 1.2.0a0+69e3229
Running on torchvision: 0.3.0a0+8837e0e

ModelType: resnet50, Kernels: nn Input shape: 128x3x224x224
nn                              :forward:     659.40 (ms)     194.12 (imgs/s)
nn                             :backward:    1228.16 (ms)
nn                               :update:       4.69 (ms)
nn                                :total:    1887.55 (ms)      67.81 (imgs/s)
ModelType: resnext101_32x8d, Kernels: nn Input shape: 128x3x224x224
nn                              :forward:    2259.24 (ms)      56.66 (imgs/s)
nn                             :backward:    4234.59 (ms)
nn                               :update:      13.00 (ms)
nn                                :total:    6493.83 (ms)      19.71 (imgs/s)

Next step, I will share the performance using caching allocator. Thanks!

@XiaobingSuper
Copy link
Collaborator Author

Adding the performance using idepp caching allocator on SKX-6148 with 2 sockets, there has 1.36x performance improvement at least for inference with large batch size and has 1.32x performance improvement at least for training.

  1. For inference using MKLDNN with large batch size:
models MKLDNN_default allocator (img/s) MKLDNN_caching allocator(img/s) speed up
resnet50 196.91 289.69 147.12%
resnext101 58.23 79.77 136.99%
  1. For training using MKLDNN:
models MKLDNN_default allocator (img/s) MKLDNN_caching allocator(img/s) speed up
resnet50 67.81 96.05 141.65%
resnext101 19.71 26.2 132.93%

@XiaobingSuper XiaobingSuper force-pushed the mkldnn_conv_backward branch from d19eaab to 1db67f7 Compare July 18, 2019 05:08
@dzhulgakov dzhulgakov requested a review from zheng-xq July 22, 2019 00:08
@dzhulgakov
Copy link
Collaborator

Adding CPU caching allocator is something we also discussed. The problem is mostly alleviated by using better malloc - e.g. using jemalloc (http://jemalloc.net/jemalloc.3.html). Unfortunately, I don't think there's a safe way to package it with prebuilt pytorch binaries - it has to be preloaded for the entire Python process.

Overall, this PR looks pretty good to me, also cc @zheng-xq to take a look

@XiaobingSuper
Copy link
Collaborator Author

@zheng-xq , can you help review it?

@XiaobingSuper
Copy link
Collaborator Author

@bddppq, can you help review this code, perharps we can first merge this PR which can unify the mkldnn convolution code. Thanks!

@bddppq
Copy link
Contributor

bddppq commented Sep 6, 2019

@pytorchbot rebase this please

@XiaobingSuper XiaobingSuper force-pushed the mkldnn_conv_backward branch 2 times, most recently from 82ac1f3 to 704376d Compare September 10, 2019 01:56
@XiaobingSuper
Copy link
Collaborator Author

@VitalyFedyunin, I rebase the code again, can you help review it when you have free time? Thanks!

@XiaobingSuper
Copy link
Collaborator Author

@VitalyFedyunin, the test failed cases are not related to this PR. Thanks!

@XiaobingSuper
Copy link
Collaborator Author

@VitalyFedyunin

@XiaobingSuper XiaobingSuper force-pushed the mkldnn_conv_backward branch 3 times, most recently from 86c3d42 to 69e75e9 Compare October 28, 2019 07:04
@Jianhui-Li
Copy link

@dzhulgakov regarding the jemalloc. We tried out Jemalloc and TCmalloc, but both creates extra dependencies and doesn’t support NUMA. TF expose NUMA as subdevice and user can do NUMA-aware memory allocation. The memory allocator makes big difference when running trainning or offline inference with large batch size, since malloc may trigger large overhead since the malloc of large chunk memory causes clear page overhead. We observed +30% benefit on throughput. Are there any plan for Pytorch to support similar capability, if not, do you think it would work for us to implement one and PR it?

@XiaobingSuper
Copy link
Collaborator Author

This PR reopened in #36121.

taka-sawada pushed a commit to fujitsu/pytorch that referenced this pull request Nov 9, 2020
taka-sawada pushed a commit to fujitsu/pytorch that referenced this pull request Nov 9, 2020
refs:pytorch#14 feat: [v1.5.0][pytorch#20567] Add aten mkldnn conv2d backward operator.

See merge request postk_dl/pytorch!12
taka-sawada pushed a commit to fujitsu/pytorch that referenced this pull request Dec 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: autograd Related to torch.autograd, and the autograd engine in general module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants