-
Notifications
You must be signed in to change notification settings - Fork 26.3k
BERT CPU performance optimization: use mkldnn for nn.Linear() when input is dense layout #21851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR assumes that from torch.utils import mkldnn as mkldnn_utils
model = mkldnn_utils.to_mkldnn(model)I forked BERT to create the benchmark, use
|
|
TODOs: |
|
@dzhulgakov can you review this? |
|
@bddppq is the mkl master |
dzhulgakov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, you're trying to target only an explicitly converted model, right? we could also just prepend .to_mkldnn() somewhere in code before the model - would it be sufficient?
bddppq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mingfeima Could you clarify where does the perf gain come from? Is it:
- mkldnn_linear being faster than pytorch's default cpu linear, or
- with mkldnn some other ops in bert run faster than pytorch's cpu implementation (But since currently our workflow only supports running the whole model with mkldnn, so you need to loose linear layer to accept cpu dense tensor as input to work around this limitation)?
Also when you compare the perf, did you build pytorch with mkl as blas (i.e. BLAS=MKL in the build command line)? pytorch's cpu linear implementation (so your base line) is faster with this.
cc @BIT-silence @llyfacebook to repro the perf improvment
|
Mingfei's data is measured on a specific configuration where MKL-DNN has better performance than MKL. Let me communicate with MKL-DNN team how we should proceed – whether we should replace MKL-DNN with MKL for all shapes/configuration, or leave it as user option. This PR looses the condition of input format so that it take the dense input format so make the MKL-DNN integarted to the CPU path. I think it is better for us to stick to the principal that MKL-DNN integration to the MKL-DNN path. |
|
Thanks for the review. Sorry for the late response, i have some family emergency to handle last week.
@dzhulgakov Yes, this is only targeting explicitly converted model. And i think aligns with current design. And the model conversion serves as a switch whether you want to use Only prepending |
@bddppq First of all, i always compile with MKL...The MKL version here is The performance gain comes from that for some configs MKLDNN sgemm is faster than MKL's. BERT has 3 major sizes of gemm: input_channel/output_channel = (768, 768), (768, 3072), (3072, 768). GEMM performance is a huge topic and the current status is for some sizes mkldnn is faster and for others mkl is faster.
|
|
@dzhulgakov @BIT-silence @bddppq This PR has been updated according to your feedback, please reivew. |
dzhulgakov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good to go (modulo minor comments)
Ideally, we'd have MKLDNN implementation on par with MKL one so that we don't need to do user-visible conversion. But this PR is ok as it enhances separate mkldnn specific api
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dzhulgakov has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
just moved the the conversion is single socket result (20 cores) on Xeon 6148: (unit: iterations/sec):
408/408 [00:24<00:00, 16.69it/s]
408/408 [00:16<00:00, 24.06it/s]
408/408 [00:18<00:00, 21.95it/s] |
|
UPDATES: multiple instance result: NB: the script sets
>>> ./run_inference.sh --multi_instances
Average latency per example: 469.058ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 42.64
Total time: 23.453s
>>> ./run_inference.sh --multi_instances --mkldnn
Average latency per example: 370.495ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 53.98
Total time: 18.525s |
dzhulgakov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty clean to me!
| ideep::inner_product_forward::compute(x, w, y); | ||
| } | ||
|
|
||
| auto input_size = self.sizes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move it inside the "if" below as it's used only there
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dzhulgakov has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@dzhulgakov Could you land this PR? Thanks. |
…put is dense layout (#21851) Summary: This PR aims at improving BERT performance on CPU by using `mkldnn` inner product for `nn.Linear()`. The current logic is to use `mkldnn` only when `input` tensor is of mkldnn layout. This PR loosens this condition, `mkldnn` will be used for `nn.Linear()` when `input` tensor is of dense layout. The aten tensor is viewed inplace in `mkldnn` without additional memory copy. 1. when `input.dim() >= 3` , it is viewed as 2d tensor. e.g. `[T, N, C]` is treated as `[TN, C]`; 2. when `input` is not contiguous, it is copied so as to be contiguous. `mkldnn` inner product can't handle non-contiguous memory. With this PR, BERT on `glue/MRPC` inference (batch size = 1) on Xeon 6148 single socket (20 [email protected]) improves by `44%`: 1. before (unit: iterations/sec): ```bash 408/408 [00:24<00:00, 16.69it/s] ``` 2. after (unit: iterations/sec): ```bash 408/408 [00:16<00:00, 24.06it/s] ``` The latency reduces from `59.92 ms` to `41.56ms` correspondingly. Pull Request resolved: pytorch/pytorch#21851 Differential Revision: D16056334 Pulled By: dzhulgakov fbshipit-source-id: 9b70ed58323b5e2f3f4e3ebacc766a74a8b68a8a
|
@dzhulgakov merged this pull request in 25f0dc3. |
This PR aims at improving BERT performance on CPU by using
mkldnninner product fornn.Linear().The current logic is to use
mkldnnonly wheninputtensor is of mkldnn layout. This PR loosens this condition,mkldnnwill be used fornn.Linear()wheninputtensor is of dense layout. The aten tensor is viewed inplace inmkldnnwithout additional memory copy.input.dim() >= 3, it is viewed as 2d tensor. e.g.[T, N, C]is treated as[TN, C];inputis not contiguous, it is copied so as to be contiguous.mkldnninner product can't handle non-contiguous memory.With this PR, BERT on
glue/MRPCinference (batch size = 1) on Xeon 6148 single socket (20 [email protected]) improves by44%:408/408 [00:24<00:00, 16.69it/s]408/408 [00:16<00:00, 24.06it/s]The latency reduces from
59.92 msto41.56mscorrespondingly.