BERT CPU performance optimization: use mkldnn for nn.Linear() when input is dense layout #21851

mingfeima · 2019-06-17T04:52:53Z

This PR aims at improving BERT performance on CPU by using mkldnn inner product for nn.Linear().
The current logic is to use mkldnn only when input tensor is of mkldnn layout. This PR loosens this condition, mkldnn will be used for nn.Linear() when input tensor is of dense layout. The aten tensor is viewed inplace in mkldnn without additional memory copy.

when input.dim() >= 3 , it is viewed as 2d tensor. e.g. [T, N, C] is treated as [TN, C];
when input is not contiguous, it is copied so as to be contiguous. mkldnn inner product can't handle non-contiguous memory.

With this PR, BERT on glue/MRPC inference (batch size = 1) on Xeon 6148 single socket (20 [email protected]) improves by 44%:

before (unit: iterations/sec):

408/408 [00:24<00:00, 16.69it/s]

after (unit: iterations/sec):

408/408 [00:16<00:00, 24.06it/s]

The latency reduces from 59.92 ms to 41.56ms correspondingly.

mingfeima · 2019-06-17T05:00:57Z

This PR assumes that model is converted to mkldnn as the following so that mkldnn weight is cached properly:

from torch.utils import mkldnn as mkldnn_utils
model = mkldnn_utils.to_mkldnn(model)

I forked BERT to create the benchmark, use mkldnn_linear branch from this link to reproduce result from this PR:

prepare dataset according to link.
update GLUE_DIR to actual dataset path in run_inference.sh.
test original performance, bash run_inference.sh.
test mkldnn linear performance, bash run_inference.sh --mkldnn.

mingfeima · 2019-06-17T05:09:45Z

TODOs:
output from ideep is not inplaced which will trigger memory copy by mkldnn_to_dense. Will update ideep interface to remove this copy. But the overall improvement is only marginal.

umanwizard · 2019-06-17T19:22:50Z

@dzhulgakov can you review this?

dzhulgakov · 2019-06-17T23:48:10Z

@bddppq is the mkl master

dzhulgakov

so, you're trying to target only an explicitly converted model, right? we could also just prepend .to_mkldnn() somewhere in code before the model - would it be sufficient?

aten/src/ATen/native/mkldnn/Linear.cpp

bddppq

@mingfeima Could you clarify where does the perf gain come from? Is it:

mkldnn_linear being faster than pytorch's default cpu linear, or
with mkldnn some other ops in bert run faster than pytorch's cpu implementation (But since currently our workflow only supports running the whole model with mkldnn, so you need to loose linear layer to accept cpu dense tensor as input to work around this limitation)?

Also when you compare the perf, did you build pytorch with mkl as blas (i.e. BLAS=MKL in the build command line)? pytorch's cpu linear implementation (so your base line) is faster with this.

cc @BIT-silence @llyfacebook to repro the perf improvment

Jianhui-Li · 2019-06-18T07:00:35Z

@mingfeima @bddppq

Mingfei's data is measured on a specific configuration where MKL-DNN has better performance than MKL. Let me communicate with MKL-DNN team how we should proceed – whether we should replace MKL-DNN with MKL for all shapes/configuration, or leave it as user option.

This PR looses the condition of input format so that it take the dense input format so make the MKL-DNN integarted to the CPU path. I think it is better for us to stick to the principal that MKL-DNN integration to the MKL-DNN path.

aten/src/ATen/native/mkldnn/Linear.cpp

mingfeima · 2019-06-24T06:21:35Z

Thanks for the review. Sorry for the late response, i have some family emergency to handle last week.

so, you're trying to target only an explicitly converted model, right? we could also just prepend .to_mkldnn() somewhere in code before the model - would it be sufficient?

@dzhulgakov Yes, this is only targeting explicitly converted model. And i think aligns with current design. And the model conversion serves as a switch whether you want to use mkldnn or not. This allows you to select best config flexibly (explained in section below).

Only prepending input.to_mkldnn() is not not sufficient as model conversion is used for weight caching for inference. And input.to_mkldnn() will trigger additional memory copy, this is not necessary since the model of BERT has no convolution so the nn.Linear is working on plain layout not mkldnn layout. So just do inplace (no memory copy) view between at::tensor and ideep::tensor will be enough.

mingfeima · 2019-06-24T06:43:24Z

@mingfeima Could you clarify where does the perf gain come from? Is it:

@bddppq First of all, i always compile with MKL...The MKL version here is 2019.4.

The performance gain comes from that for some configs MKLDNN sgemm is faster than MKL's.

BERT has 3 major sizes of gemm: input_channel/output_channel = (768, 768), (768, 3072), (3072, 768).
For the last two mkldnn is faster and for the first one mkl is faster, you can find the details in this gist.

GEMM performance is a huge topic and the current status is for some sizes mkldnn is faster and for others mkl is faster.
From Intel perspective, in case mkldnn has worse performance for some specific size, mkldnn team will try to fix it.
But this PR makes it possible to easily select the faster implementation for each particular size of the GEMMs even without Intel's support. You just have to:

collect hotspot gemm sizes;
write a benchmark to determine which is faster for a particular gemm, mkldnn or mkl?
in case using mkldnn is beneficial, convert this module, mod = nn.Linear(ic, oc), mod.to_mkldnn(); if not, leave it alone and mkl will be used.

mingfeima · 2019-06-26T08:15:45Z

@dzhulgakov @BIT-silence @bddppq This PR has been updated according to your feedback, please reivew.

dzhulgakov

I think it's good to go (modulo minor comments)

Ideally, we'd have MKLDNN implementation on par with MKL one so that we don't need to do user-visible conversion. But this PR is ok as it enhances separate mkldnn specific api

aten/src/ATen/native/mkldnn/Linear.cpp

facebook-github-bot

@dzhulgakov has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

aten/src/ATen/native/native_functions.yaml

torch/utils/mkldnn.py

mingfeima · 2019-07-04T01:55:09Z

just moved the mkldnn/cpu tensor conversion logic from native/mkldnn/Linear.cpp to torch/utils/mkldnn.py.

the conversion is outplaced at the moment and will increase some performance downgrade due to additional memory copy:

single socket result (20 cores) on Xeon 6148: (unit: iterations/sec):

mkl (original):

408/408 [00:24<00:00, 16.69it/s]

mkldnn: in place conversion, no memcpy:

408/408 [00:16<00:00, 24.06it/s]

mkldnn: out place conversion, with memcpy:

408/408 [00:18<00:00, 21.95it/s]

mingfeima · 2019-07-04T06:07:29Z

UPDATES: multiple instance result:
Xeon 6148: using 20 threads and 1 core per each thread.
Benchmark has been rewritten with ThroughputBenchmark, use run_inference.sh to reproduce.

NB: the script sets num_calling_threads of ThroughputBenchmark to be 20 and regulate OMP_NUM_THREADS=1, let me know if this is not your initial intention.

before (mkl sgemm)

>>> ./run_inference.sh --multi_instances
Average latency per example: 469.058ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 42.64
Total time: 23.453s

after (mkldnn sgemm)

>>> ./run_inference.sh --multi_instances --mkldnn
Average latency per example: 370.495ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 53.98
Total time: 18.525s

dzhulgakov

Looks pretty clean to me!

dzhulgakov · 2019-07-11T07:19:11Z

aten/src/ATen/native/mkldnn/Linear.cpp

    ideep::inner_product_forward::compute(x, w, y);
  }

+  auto input_size = self.sizes();


nit: move it inside the "if" below as it's used only there

facebook-github-bot

@dzhulgakov has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zhangguanheng66 · 2019-07-18T21:36:22Z

@dzhulgakov Could you land this PR? Thanks.

…put is dense layout (#21851) Summary: This PR aims at improving BERT performance on CPU by using `mkldnn` inner product for `nn.Linear()`. The current logic is to use `mkldnn` only when `input` tensor is of mkldnn layout. This PR loosens this condition, `mkldnn` will be used for `nn.Linear()` when `input` tensor is of dense layout. The aten tensor is viewed inplace in `mkldnn` without additional memory copy. 1. when `input.dim() >= 3` , it is viewed as 2d tensor. e.g. `[T, N, C]` is treated as `[TN, C]`; 2. when `input` is not contiguous, it is copied so as to be contiguous. `mkldnn` inner product can't handle non-contiguous memory. With this PR, BERT on `glue/MRPC` inference (batch size = 1) on Xeon 6148 single socket (20 [email protected]) improves by `44%`: 1. before (unit: iterations/sec): ```bash 408/408 [00:24<00:00, 16.69it/s] ``` 2. after (unit: iterations/sec): ```bash 408/408 [00:16<00:00, 24.06it/s] ``` The latency reduces from `59.92 ms` to `41.56ms` correspondingly. Pull Request resolved: pytorch/pytorch#21851 Differential Revision: D16056334 Pulled By: dzhulgakov fbshipit-source-id: 9b70ed58323b5e2f3f4e3ebacc766a74a8b68a8a

facebook-github-bot · 2019-07-19T09:11:13Z

@dzhulgakov merged this pull request in 25f0dc3.

pytorchbot added module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration module: operators labels Jun 17, 2019

ezyang added the open source label Jun 17, 2019

umanwizard requested a review from dzhulgakov June 17, 2019 19:22

umanwizard added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 17, 2019

dzhulgakov requested a review from bddppq June 17, 2019 23:47

dzhulgakov reviewed Jun 17, 2019

View reviewed changes

bddppq reviewed Jun 18, 2019

View reviewed changes

xiaomengy reviewed Jun 24, 2019

View reviewed changes

aten/src/ATen/native/mkldnn/Linear.cpp Outdated Show resolved Hide resolved

aten/src/ATen/native/mkldnn/Linear.cpp Outdated Show resolved Hide resolved

dzhulgakov approved these changes Jun 28, 2019

View reviewed changes

aten/src/ATen/native/mkldnn/Linear.cpp Outdated Show resolved Hide resolved

aten/src/ATen/native/mkldnn/Linear.cpp Outdated Show resolved Hide resolved

aten/src/ATen/native/mkldnn/Linear.cpp Outdated Show resolved Hide resolved

facebook-github-bot reviewed Jun 28, 2019

View reviewed changes

dzhulgakov requested changes Jun 28, 2019

View reviewed changes

aten/src/ATen/native/native_functions.yaml Outdated Show resolved Hide resolved

mingfeima added 4 commits July 3, 2019 09:54

use mkldnn for nn.Linear() when input is dense layout

705fde6

use reshape for mkldnn tensor input when dim > 2 and other fix

5cb9c73

do not rely on lifetime extension

85e2ae4

move cpu/mkldnn tensor conversion logic from native to python utils

63460ee

mingfeima force-pushed the mkldnn/sgemm branch from 1556c4a to 63460ee Compare July 4, 2019 01:28

jgong5 reviewed Jul 4, 2019

View reviewed changes

torch/utils/mkldnn.py Show resolved Hide resolved

minor change

5bc423b

dzhulgakov approved these changes Jul 11, 2019

View reviewed changes

facebook-github-bot reviewed Jul 11, 2019

View reviewed changes

facebook-github-bot closed this in 25f0dc3 Jul 19, 2019

facebook-github-bot added the merged label Jul 19, 2019

danieldk mentioned this pull request Sep 4, 2020

Add the ToMKLDNN trait and implement it for Linear LaurentMazare/tch-rs#246

Closed

mruberry added the Merged label Oct 28, 2020

BERT CPU performance optimization: use mkldnn for nn.Linear() when input is dense layout #21851

BERT CPU performance optimization: use mkldnn for nn.Linear() when input is dense layout #21851

Uh oh!

Conversation

mingfeima commented Jun 17, 2019

Uh oh!

mingfeima commented Jun 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mingfeima commented Jun 17, 2019

Uh oh!

umanwizard commented Jun 17, 2019

Uh oh!

dzhulgakov commented Jun 17, 2019

Uh oh!

dzhulgakov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bddppq left a comment

Choose a reason for hiding this comment

Uh oh!

Jianhui-Li commented Jun 18, 2019

Uh oh!

Uh oh!

Uh oh!

mingfeima commented Jun 24, 2019

Uh oh!

mingfeima commented Jun 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mingfeima commented Jun 26, 2019

Uh oh!

dzhulgakov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mingfeima commented Jul 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mingfeima commented Jul 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dzhulgakov left a comment

Choose a reason for hiding this comment

Uh oh!

dzhulgakov Jul 11, 2019

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

zhangguanheng66 commented Jul 18, 2019

Uh oh!

facebook-github-bot commented Jul 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

mingfeima commented Jun 17, 2019 •

edited

Loading

mingfeima commented Jun 24, 2019 •

edited

Loading

mingfeima commented Jul 4, 2019 •

edited

Loading

mingfeima commented Jul 4, 2019 •

edited

Loading