Optimize some reduction operators on CPU BFloat16#55202
Optimize some reduction operators on CPU BFloat16#55202mingfeima wants to merge 14 commits intogh/mingfeima/16/basefrom
Conversation
[ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit f2c070c (more details on the Dr. CI page and at hud.pytorch.org/pr/55202): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
This PR aims at enabling or optimizing the following reduction operators on CPU BFloat16:
Note that previously some operators e.g. Since we have already specializes all members for Vec256<scalar_t> with {scalar_t=BFloat16} in vec256_bfloat16.h, so the following example would run smoothing on BFloat16: using Vec = Vec256<BFloat16>;
Vec one = Vec(BFloat16(1));
vec256::map([](Vec x) { return one / (one + x.exp()); }, y_ptr, x_ptr, N);The current impl will end up with 3 pairs of dtype conversion, each for ".exp()", "+" and "/" respectively.
I am going to use this stack to push more BFloat16 CPU optimizations, following the same manner as propose in this PR:
|
|
Since this PR is not related to parallelization feature, only single core perf is tested: - AT_DISPATCH_ALL_TYPES(input.scalar_type(), "_aminmax_all_all", [&] {
+ AT_DISPATCH_ALL_TYPES_AND(kBFloat16, input.scalar_type(), "_aminmax_all_all", [&] {
before: softmax: 128x1024: fp32: 151.457 us; bf16: 362.440 us
after: softmax: 128x1024: fp32: 151.757 us; bf16: 194.105 us
before: log_softmax: 128x1024: fp32: 157.474 us; bf16: 411.537 us
after: log_softmax: 128x1024: fp32: 152.229 us; bf16: 163.657 us
before: max: 128x1024: fp32: 23.714 us; bf16: 63.077 us
after: max: 128x1024: fp32: 24.523 us; bf16: 17.484 us
before: min: 128x1024: fp32: 23.707 us; bf16: 63.198 us
after: min: 128x1024: fp32: 24.067 us; bf16: 17.498 us
before: _aminmax: 128x1024: fp32: 25.156 us; bf16: 69.929 us
after: _aminmax: 128x1024: fp32: 25.272 us; bf16: 19.487 us
before: softmax: 128x1024: fp32: 229.048 us; bf16: 493.946 us
after: softmax: 128x1024: fp32: 248.060 us; bf16: 293.789 us
before: log_softmax: 128x1024: fp32: 274.467 us; bf16: 673.550 us
after: log_softmax: 128x1024: fp32: 251.243 us; bf16: 241.080 us
before: max: 128x1024: fp32: 36.288 us; bf16: 88.725 us
after: max: 128x1024: fp32: 32.229 us; bf16: 23.905 us
before: min: 128x1024: fp32: 32.179 us; bf16: 87.669 us
after: min: 128x1024: fp32: 30.829 us; bf16: 22.847 us
before: _aminmax: 128x1024: fp32: 33.098 us; bf16: 105.431 us
after: _aminmax: 128x1024: fp32: 30.785 us; bf16: 28.113 usNotes: With this PR, BFloat16 Softmax is still slower than float32, because lack of native dtype conversion intrinsics (currently fp32/bf16 conversion uses emulated method), on Sapphire Rapids BFloat16 is faster. |
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
ghstack-source-id: 225abdf Pull Request resolved: pytorch#55202
[ghstack-poisoned]
[ghstack-poisoned]
|
@VitalyFedyunin Could you please review this stack? We are trying to optimize CPU BFloat16 path performance. |
[ghstack-poisoned]
ghstack-source-id: 92816b7 Pull Request resolved: pytorch#55202
|
@mdschatz has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
Hi! Please rebase to adapt to more general vectorization as we are in process of introducing AVX512 support. |
Differential Revision: [D28836790](https://our.internmc.facebook.com/intern/diff/D28836790) [ghstack-poisoned]
|
@VitalyFedyunin Hi, this stack has been rebased! |
Differential Revision: [D28836790](https://our.internmc.facebook.com/intern/diff/D28836790) [ghstack-poisoned]
VitalyFedyunin
left a comment
There was a problem hiding this comment.
Some code moves required, otherwise looks fine
Differential Revision: [D28836790](https://our.internmc.facebook.com/intern/diff/D28836790) [ghstack-poisoned]
Differential Revision: [D28836790](https://our.internmc.facebook.com/intern/diff/D28836790) [ghstack-poisoned]
|
@VitalyFedyunin, updated! Also add map4 test in vec256_test_all_types_XXX since other PR got merged. Please check. |
|
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
@mingfeima, I've noticed that including I'd defer to @VitalyFedyunin to decide how to properly address though. |
|
@VitalyFedyunin merged this pull request in 5a077bb. |
|
@mdschatz The compilation error was due to an incorrect header file inclusion in IPEX (should include |
right, you should include |
Stack from ghstack:
Differential Revision: D28836790