Optimize some reduction operators on CPU BFloat16 by mingfeima · Pull Request #55202 · pytorch/pytorch

mingfeima · 2021-04-02T01:58:49Z

Stack from ghstack:

add BFloat16 support for UpSample on CPU #58297 add BFloat16 support for UpSample on CPU
optimize transpose copy for float32 and bfloat16 on CPU #56904 optimize transpose copy for float32 and bfloat16 on CPU
add BFloat16 support for MaxPool2d on CPU #56903 add BFloat16 support for MaxPool2d on CPU
add BFloat16 support for AdaptiveAvgPool2d on CPU #56902 add BFloat16 support for AdaptiveAvgPool2d on CPU
add BFloat16 support for bernoulli and Dropout on CPU #56372 add BFloat16 support for bernoulli and Dropout on CPU
add bf16 support for bucketize #55588 add bf16 support for bucketize
optimize BFloat16 elemwise operators CPU: sigmoid, sigmoid_backward, tanh_backward, addcmul, addcdiv #55221 optimize BFloat16 elemwise operators CPU: sigmoid, sigmoid_backward, tanh_backward, addcmul, addcdiv
SumKernel (BFloat16): use float as accumulation type #55217 SumKernel (BFloat16): use float as accumulation type
add BFloat16 support for LayerNorm CPU #55210 add BFloat16 support for LayerNorm CPU
Optimize some reduction operators on CPU BFloat16 #55202 Optimize some reduction operators on CPU BFloat16

Differential Revision: D28836790

[ghstack-poisoned]

facebook-github-bot · 2021-04-02T01:58:56Z

💊 CI failures summary and remediations

As of commit f2c070c (more details on the Dr. CI page and at hud.pytorch.org/pr/55202):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ghstack-source-id: 15f6804 Pull Request resolved: #55202

mingfeima · 2021-04-02T02:17:27Z

This PR aims at enabling or optimizing the following reduction operators on CPU BFloat16:

softmax (don't support BFloat16 previously)
log_softmax
max (reduce_all)
min (reduce_all)
_aminmax (reduce_all; don't support BFloat16 previously)

Note that previously some operators e.g. log_softmax already have BFloat16 support, but the perf is not good.
This PR specializes all the map<>() from vec256/functional.h and also corresponding test cases are added in vec256_test_all_types.cpp.

Since we have already specializes all members for Vec256<scalar_t> with {scalar_t=BFloat16} in vec256_bfloat16.h, so the following example would run smoothing on BFloat16:

using Vec = Vec256<BFloat16>;
Vec one = Vec(BFloat16(1));
vec256::map([](Vec x) { return one / (one + x.exp()); }, y_ptr, x_ptr, N);

The current impl will end up with 3 pairs of dtype conversion, each for ".exp()", "+" and "/" respectively.
The new impl will only need dtype conversion for input and output. Benefits:

better performance since we have less dtype conversion;
less rounding error since immediate results are kept in fp32;
accumulation done on data type of fp32.

I am going to use this stack to push more BFloat16 CPU optimizations, following the same manner as propose in this PR:

input load: bf16->fp32; output store: fp32->bf16
all immediate operations (including accumulation) will use fp32

mingfeima · 2021-04-02T02:37:23Z

Since this PR is not related to parallelization feature, only single core perf is tested:
NB: if the operator doesn't have BFloat16 support previously, the before perf refers to simple impl as following:

-   AT_DISPATCH_ALL_TYPES(input.scalar_type(), "_aminmax_all_all", [&] {
+   AT_DISPATCH_ALL_TYPES_AND(kBFloat16, input.scalar_type(), "_aminmax_all_all", [&] {

performance update on avx512 machine: Xeon(R) Gold 6248 CPU @ 2.50GHz

before: softmax: 128x1024: fp32: 151.457 us; bf16: 362.440 us
after:  softmax: 128x1024: fp32: 151.757 us; bf16: 194.105 us

before: log_softmax: 128x1024: fp32: 157.474 us; bf16: 411.537 us
after:  log_softmax: 128x1024: fp32: 152.229 us; bf16: 163.657 us

before: max: 128x1024: fp32: 23.714 us; bf16: 63.077 us
after:  max: 128x1024: fp32: 24.523 us; bf16: 17.484 us

before: min: 128x1024: fp32: 23.707 us; bf16: 63.198 us
after:  min: 128x1024: fp32: 24.067 us; bf16: 17.498 us

before: _aminmax: 128x1024: fp32: 25.156 us; bf16: 69.929 us
after:  _aminmax: 128x1024: fp32: 25.272 us; bf16: 19.487 us

performance update on avx2 machine: Xeon(R) CPU E5-2680 v3 @ 2.50GHz

before: softmax: 128x1024: fp32: 229.048 us; bf16: 493.946 us
after:  softmax: 128x1024: fp32: 248.060 us; bf16: 293.789 us

before: log_softmax: 128x1024: fp32: 274.467 us; bf16: 673.550 us
after:  log_softmax: 128x1024: fp32: 251.243 us; bf16: 241.080 us

before: max: 128x1024: fp32: 36.288 us; bf16: 88.725 us
after:  max: 128x1024: fp32: 32.229 us; bf16: 23.905 us

before: min: 128x1024: fp32: 32.179 us; bf16: 87.669 us
after:  min: 128x1024: fp32: 30.829 us; bf16: 22.847 us

before: _aminmax: 128x1024: fp32: 33.098 us; bf16: 105.431 us
after:  _aminmax: 128x1024: fp32: 30.785 us; bf16: 28.113 us

Notes: With this PR, BFloat16 Softmax is still slower than float32, because lack of native dtype conversion intrinsics (currently fp32/bf16 conversion uses emulated method), on Sapphire Rapids BFloat16 is faster.

[ghstack-poisoned]

ghstack-source-id: 225abdf Pull Request resolved: pytorch#55202

[ghstack-poisoned]

mingfeima · 2021-05-13T07:32:11Z

@VitalyFedyunin Could you please review this stack? We are trying to optimize CPU BFloat16 path performance.

[ghstack-poisoned]

ghstack-source-id: 92816b7 Pull Request resolved: pytorch#55202

mdschatz · 2021-05-25T23:53:17Z

@mdschatz has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin · 2021-06-02T16:12:36Z

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin · 2021-06-02T16:13:30Z

Hi! Please rebase to adapt to more general vectorization as we are in process of introducing AVX512 support.

Differential Revision: [D28836790](https://our.internmc.facebook.com/intern/diff/D28836790) [ghstack-poisoned]

mingfeima · 2021-06-04T06:30:55Z

@VitalyFedyunin Hi, this stack has been rebased!

Differential Revision: [D28836790](https://our.internmc.facebook.com/intern/diff/D28836790) [ghstack-poisoned]

aten/src/ATen/cpu/vec/vec256/functional.h

aten/src/ATen/native/cpu/CatKernel.cpp

aten/src/ATen/test/vec_test_all_types.cpp

aten/src/ATen/native/cpu/UnaryOpsKernel.cpp

VitalyFedyunin

Some code moves required, otherwise looks fine

Differential Revision: [D28836790](https://our.internmc.facebook.com/intern/diff/D28836790) [ghstack-poisoned]

mingfeima · 2021-06-18T02:53:14Z

@VitalyFedyunin, updated! Also add map4 test in vec256_test_all_types_XXX since other PR got merged. Please check.

VitalyFedyunin · 2021-06-23T16:41:15Z

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mdschatz · 2021-06-24T16:53:38Z

@mingfeima, I've noticed that including vec256_float.h is needed in vec256_bfloat16.h in order to get compilation to proceed; without it, I am getting a lot of __m256 type-cast errors. Should that line be included in this PR or not? @jgong5 Can you elaborate on this?

I'd defer to @VitalyFedyunin to decide how to properly address though.

facebook-github-bot · 2021-06-24T17:51:43Z

@VitalyFedyunin merged this pull request in 5a077bb.

jgong5 · 2021-06-25T00:31:06Z

@mdschatz The compilation error was due to an incorrect header file inclusion in IPEX (should include ATen/cpu/vec/vec.h instead of vec256_bfloat16.h directly, and we have fixed it there. So, no worries!

mingfeima · 2021-06-25T01:04:32Z

@mdschatz The compilation error was due to an incorrect header file inclusion in IPEX (should include ATen/cpu/vec/vec.h instead of vec256_bfloat16.h directly, and we have fixed it there. So, no worries!

right, you should include <ATen/cpu/vec/vec.h> instead of directly include 'vec256_xxx.h'

Optimize some redunction operators on CPU BFloat16

decf116

[ghstack-poisoned]

facebook-github-bot added the cla signed label Apr 2, 2021

mingfeima added a commit that referenced this pull request Apr 2, 2021

Optimize some redunction operators on CPU BFloat16

217cd4c

ghstack-source-id: 15f6804 Pull Request resolved: #55202

pytorchbot added the open source label Apr 2, 2021

This was referenced Apr 2, 2021

add BFloat16 support for LayerNorm CPU #55210

Closed

SumKernel (BFloat16): use float as accumulation type #55217

Closed

optimize BFloat16 elemwise operators CPU: sigmoid, sigmoid_backward, tanh_backward, addcmul, addcdiv #55221

Closed

Update on "Optimize some redunction operators on CPU BFloat16"

fb5c8ea

[ghstack-poisoned]

mingfeima mentioned this pull request Apr 8, 2021

add bf16 support for bucketize #55588

Closed

mingfeima added 2 commits April 12, 2021 19:27

Update on "Optimize some redunction operators on CPU BFloat16"

2b41db6

[ghstack-poisoned]

Update on "Optimize some redunction operators on CPU BFloat16"

231b25e

[ghstack-poisoned]

mingfeima mentioned this pull request Apr 19, 2021

add BFloat16 support for bernoulli and Dropout on CPU #56372

Closed

Update on "Optimize some redunction operators on CPU BFloat16"

0ed1a4b

[ghstack-poisoned]

This was referenced Apr 26, 2021

add BFloat16 support for AdaptiveAvgPool2d on CPU #56902

Closed

add BFloat16 support for MaxPool2d on CPU #56903

Closed

optimize transpose copy for float32 and bfloat16 on CPU #56904

Closed

mingfeima added 2 commits April 27, 2021 09:55

Update on "Optimize some redunction operators on CPU BFloat16"

e80a83c

[ghstack-poisoned]

Update on "Optimize some redunction operators on CPU BFloat16"

7a94ccd

[ghstack-poisoned]

mingfeima added a commit to mingfeima/pytorch that referenced this pull request Apr 28, 2021

Optimize some redunction operators on CPU BFloat16

3bdd6be

ghstack-source-id: 225abdf Pull Request resolved: pytorch#55202

Update on "Optimize some redunction operators on CPU BFloat16"

9832f99

[ghstack-poisoned]

mingfeima mentioned this pull request May 13, 2021

use mkldnn for Linear on CPU BFloat16 dtype #58210

Closed

Update on "Optimize some redunction operators on CPU BFloat16"

f13f071

[ghstack-poisoned]

mingfeima requested a review from VitalyFedyunin May 13, 2021 03:01

Update on "Optimize some redunction operators on CPU BFloat16"

a50c646

[ghstack-poisoned]

mingfeima mentioned this pull request May 14, 2021

add BFloat16 support for UpSample on CPU #58297

Closed

dgl-intel pushed a commit to dgl-intel/pytorch that referenced this pull request May 14, 2021

Optimize some redunction operators on CPU BFloat16

639ff90

ghstack-source-id: 92816b7 Pull Request resolved: pytorch#55202

imaginary-person added a commit to imaginary-person/pytorch-1 that referenced this pull request May 28, 2021

Proactively merge pytorch#55202 into this draft

80d83ee

Update on "Optimize some redunction operators on CPU BFloat16"

4f6f122

Differential Revision: [D28836790](https://our.internmc.facebook.com/intern/diff/D28836790) [ghstack-poisoned]

Update on "Optimize some redunction operators on CPU BFloat16"

342f94d

Differential Revision: [D28836790](https://our.internmc.facebook.com/intern/diff/D28836790) [ghstack-poisoned]

VitalyFedyunin reviewed Jun 15, 2021

View reviewed changes

aten/src/ATen/cpu/vec/vec256/functional.h Outdated Show resolved Hide resolved

VitalyFedyunin reviewed Jun 15, 2021

View reviewed changes

aten/src/ATen/native/cpu/CatKernel.cpp Outdated Show resolved Hide resolved

VitalyFedyunin reviewed Jun 15, 2021

View reviewed changes

aten/src/ATen/test/vec_test_all_types.cpp Show resolved Hide resolved

VitalyFedyunin reviewed Jun 15, 2021

View reviewed changes

aten/src/ATen/native/cpu/UnaryOpsKernel.cpp Show resolved Hide resolved

VitalyFedyunin suggested changes Jun 15, 2021

View reviewed changes

imaginary-person mentioned this pull request Jun 16, 2021

Add AVX512 support in ATen & remove AVX support #56992

Closed

1 task

mingfeima changed the title ~~Optimize some redunction operators on CPU BFloat16~~ [WIP] Optimize some redunction operators on CPU BFloat16 Jun 17, 2021

Update on "[WIP] Optimize some redunction operators on CPU BFloat16"

25752bb

Differential Revision: [D28836790](https://our.internmc.facebook.com/intern/diff/D28836790) [ghstack-poisoned]

mingfeima changed the title ~~[WIP] Optimize some redunction operators on CPU BFloat16~~ Optimize some reduction operators on CPU BFloat16 Jun 18, 2021

Update on "Optimize some reduction operators on CPU BFloat16"

f2c070c

Differential Revision: [D28836790](https://our.internmc.facebook.com/intern/diff/D28836790) [ghstack-poisoned]

VitalyFedyunin approved these changes Jun 23, 2021

View reviewed changes

facebook-github-bot closed this in 5a077bb Jun 24, 2021

facebook-github-bot added the Merged label Jun 24, 2021

facebook-github-bot deleted the gh/mingfeima/16/head branch June 28, 2021 14:17

dncliss mentioned this pull request Jul 13, 2021

Bfloat16 build errors on pp64le in vec_test_all_types.cpp #61575

Closed

dncliss mentioned this pull request Jul 22, 2021

ppc64le build fail: invalid conversion from Bfloat16 in functional_base.h:180 #62016

Open

Conversation

mingfeima commented Apr 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Apr 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

mingfeima commented Apr 2, 2021

Uh oh!

mingfeima commented Apr 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mingfeima commented May 13, 2021

Uh oh!

mdschatz commented May 25, 2021

Uh oh!

VitalyFedyunin commented Jun 2, 2021

Uh oh!

VitalyFedyunin commented Jun 2, 2021

Uh oh!

mingfeima commented Jun 4, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Uh oh!

mingfeima commented Jun 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VitalyFedyunin commented Jun 23, 2021

Uh oh!

mdschatz commented Jun 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 24, 2021

Uh oh!

jgong5 commented Jun 25, 2021

Uh oh!

mingfeima commented Jun 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mingfeima commented Apr 2, 2021 •

edited

Loading

facebook-github-bot commented Apr 2, 2021 •

edited

Loading

mingfeima commented Apr 2, 2021 •

edited

Loading

mingfeima commented Jun 18, 2021 •

edited

Loading

mdschatz commented Jun 24, 2021 •

edited

Loading