Skip to content

Conversation

@mingfeima
Copy link
Collaborator

This PR aims at improving BERT performance on CPU by using mkldnn inner product for nn.Linear().
The current logic is to use mkldnn only when input tensor is of mkldnn layout. This PR loosens this condition, mkldnn will be used for nn.Linear() when input tensor is of dense layout. The aten tensor is viewed inplace in mkldnn without additional memory copy.

  1. when input.dim() >= 3 , it is viewed as 2d tensor. e.g. [T, N, C] is treated as [TN, C];
  2. when input is not contiguous, it is copied so as to be contiguous. mkldnn inner product can't handle non-contiguous memory.

With this PR, BERT on glue/MRPC inference (batch size = 1) on Xeon 6148 single socket (20 [email protected]) improves by 44%:

  1. before (unit: iterations/sec):
408/408 [00:24<00:00, 16.69it/s]
  1. after (unit: iterations/sec):
408/408 [00:16<00:00, 24.06it/s]

The latency reduces from 59.92 ms to 41.56ms correspondingly.

@pytorchbot pytorchbot added module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration module: operators labels Jun 17, 2019
@mingfeima
Copy link
Collaborator Author

mingfeima commented Jun 17, 2019

This PR assumes that model is converted to mkldnn as the following so that mkldnn weight is cached properly:

from torch.utils import mkldnn as mkldnn_utils
model = mkldnn_utils.to_mkldnn(model)

I forked BERT to create the benchmark, use mkldnn_linear branch from this link to reproduce result from this PR:

  1. prepare dataset according to link.
  2. update GLUE_DIR to actual dataset path in run_inference.sh.
  3. test original performance, bash run_inference.sh.
  4. test mkldnn linear performance, bash run_inference.sh --mkldnn.

@mingfeima
Copy link
Collaborator Author

TODOs:
output from ideep is not inplaced which will trigger memory copy by mkldnn_to_dense. Will update ideep interface to remove this copy. But the overall improvement is only marginal.

@umanwizard
Copy link
Contributor

@dzhulgakov can you review this?

@umanwizard umanwizard added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 17, 2019
@dzhulgakov dzhulgakov requested a review from bddppq June 17, 2019 23:47
@dzhulgakov
Copy link
Collaborator

@bddppq is the mkl master

Copy link
Collaborator

@dzhulgakov dzhulgakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, you're trying to target only an explicitly converted model, right? we could also just prepend .to_mkldnn() somewhere in code before the model - would it be sufficient?

Copy link
Contributor

@bddppq bddppq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mingfeima Could you clarify where does the perf gain come from? Is it:

  1. mkldnn_linear being faster than pytorch's default cpu linear, or
  2. with mkldnn some other ops in bert run faster than pytorch's cpu implementation (But since currently our workflow only supports running the whole model with mkldnn, so you need to loose linear layer to accept cpu dense tensor as input to work around this limitation)?

Also when you compare the perf, did you build pytorch with mkl as blas (i.e. BLAS=MKL in the build command line)? pytorch's cpu linear implementation (so your base line) is faster with this.

cc @BIT-silence @llyfacebook to repro the perf improvment

@Jianhui-Li
Copy link

@mingfeima @bddppq

Mingfei's data is measured on a specific configuration where MKL-DNN has better performance than MKL. Let me communicate with MKL-DNN team how we should proceed – whether we should replace MKL-DNN with MKL for all shapes/configuration, or leave it as user option.

This PR looses the condition of input format so that it take the dense input format so make the MKL-DNN integarted to the CPU path. I think it is better for us to stick to the principal that MKL-DNN integration to the MKL-DNN path.

@mingfeima
Copy link
Collaborator Author

Thanks for the review. Sorry for the late response, i have some family emergency to handle last week.

so, you're trying to target only an explicitly converted model, right? we could also just prepend .to_mkldnn() somewhere in code before the model - would it be sufficient?

@dzhulgakov Yes, this is only targeting explicitly converted model. And i think aligns with current design. And the model conversion serves as a switch whether you want to use mkldnn or not. This allows you to select best config flexibly (explained in section below).

Only prepending input.to_mkldnn() is not not sufficient as model conversion is used for weight caching for inference. And input.to_mkldnn() will trigger additional memory copy, this is not necessary since the model of BERT has no convolution so the nn.Linear is working on plain layout not mkldnn layout. So just do inplace (no memory copy) view between at::tensor and ideep::tensor will be enough.

@mingfeima
Copy link
Collaborator Author

mingfeima commented Jun 24, 2019

@mingfeima Could you clarify where does the perf gain come from? Is it:

@bddppq First of all, i always compile with MKL...The MKL version here is 2019.4.

The performance gain comes from that for some configs MKLDNN sgemm is faster than MKL's.

BERT has 3 major sizes of gemm: input_channel/output_channel = (768, 768), (768, 3072), (3072, 768).
For the last two mkldnn is faster and for the first one mkl is faster, you can find the details in this gist.

GEMM performance is a huge topic and the current status is for some sizes mkldnn is faster and for others mkl is faster.
From Intel perspective, in case mkldnn has worse performance for some specific size, mkldnn team will try to fix it.
But this PR makes it possible to easily select the faster implementation for each particular size of the GEMMs even without Intel's support. You just have to:

  1. collect hotspot gemm sizes;
  2. write a benchmark to determine which is faster for a particular gemm, mkldnn or mkl?
  3. in case using mkldnn is beneficial, convert this module, mod = nn.Linear(ic, oc), mod.to_mkldnn(); if not, leave it alone and mkl will be used.

@mingfeima
Copy link
Collaborator Author

@dzhulgakov @BIT-silence @bddppq This PR has been updated according to your feedback, please reivew.

Copy link
Collaborator

@dzhulgakov dzhulgakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to go (modulo minor comments)

Ideally, we'd have MKLDNN implementation on par with MKL one so that we don't need to do user-visible conversion. But this PR is ok as it enhances separate mkldnn specific api

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dzhulgakov has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mingfeima
Copy link
Collaborator Author

mingfeima commented Jul 4, 2019

just moved the mkldnn/cpu tensor conversion logic from native/mkldnn/Linear.cpp to torch/utils/mkldnn.py.

the conversion is outplaced at the moment and will increase some performance downgrade due to additional memory copy:

single socket result (20 cores) on Xeon 6148: (unit: iterations/sec):

  1. mkl (original):
408/408 [00:24<00:00, 16.69it/s]
  1. mkldnn: in place conversion, no memcpy:
408/408 [00:16<00:00, 24.06it/s]
  1. mkldnn: out place conversion, with memcpy:
408/408 [00:18<00:00, 21.95it/s]

@mingfeima
Copy link
Collaborator Author

mingfeima commented Jul 4, 2019

UPDATES: multiple instance result:
Xeon 6148: using 20 threads and 1 core per each thread.
Benchmark has been rewritten with ThroughputBenchmark, use run_inference.sh to reproduce.

NB: the script sets num_calling_threads of ThroughputBenchmark to be 20 and regulate OMP_NUM_THREADS=1, let me know if this is not your initial intention.

  1. before (mkl sgemm)
>>> ./run_inference.sh --multi_instances
Average latency per example: 469.058ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 42.64
Total time: 23.453s
  1. after (mkldnn sgemm)
>>> ./run_inference.sh --multi_instances --mkldnn
Average latency per example: 370.495ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 53.98
Total time: 18.525s

Copy link
Collaborator

@dzhulgakov dzhulgakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty clean to me!

ideep::inner_product_forward::compute(x, w, y);
}

auto input_size = self.sizes();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move it inside the "if" below as it's used only there

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dzhulgakov has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@zhangguanheng66
Copy link
Contributor

@dzhulgakov Could you land this PR? Thanks.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 19, 2019
…put is dense layout (#21851)

Summary:
This PR aims at improving BERT performance on CPU by using `mkldnn` inner product for `nn.Linear()`.
The current logic is to use `mkldnn` only when `input` tensor is of mkldnn layout. This PR loosens this condition, `mkldnn` will be used for `nn.Linear()` when `input` tensor is of dense layout. The aten tensor is viewed inplace in `mkldnn` without additional memory copy.
1. when `input.dim() >= 3` , it is viewed as 2d tensor. e.g. `[T, N, C]` is treated as `[TN, C]`;
2. when `input` is not contiguous, it is copied so as to be contiguous. `mkldnn` inner product can't handle non-contiguous memory.

With this PR, BERT on `glue/MRPC` inference (batch size = 1) on Xeon 6148 single socket (20 [email protected]) improves by `44%`:

1. before (unit: iterations/sec):
```bash
408/408 [00:24<00:00, 16.69it/s]
```
2. after (unit: iterations/sec):
```bash
408/408 [00:16<00:00, 24.06it/s]
```

The latency reduces from `59.92 ms` to `41.56ms` correspondingly.
Pull Request resolved: pytorch/pytorch#21851

Differential Revision: D16056334

Pulled By: dzhulgakov

fbshipit-source-id: 9b70ed58323b5e2f3f4e3ebacc766a74a8b68a8a
@facebook-github-bot
Copy link
Contributor

@dzhulgakov merged this pull request in 25f0dc3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.