-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Optimize pytorch layer_norm forward #20345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
soumith
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR now introduces Eigen into pytorch optimizations -- without discussion.
The previous comments of moving to native/cpu and using dispatcher are not addressed
c90fa95 to
addfd9d
Compare
addfd9d to
0cfecc2
Compare
|
Removed Eigen now. For reduce case, I found that even using Vec256 is still slower than Eigen. So maybe we can consider to discuss or figure out how to make it better. |
aten/src/ATen/native/layer_norm.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move LayerNormForwardCPUImpl and all CPU logic into the /cpu subfolder using DECLARE_DISPATCH logic (see CopyKernel for the reference). It will allow to utilize AVX2 instructions and other optimizations of OSS build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
aten/src/ATen/native/layer_norm.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite sure you want to call at:native::empty_like here and avoid all tracing/profiling checks.
VitalyFedyunin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR requires new tests to be written for native_layer_norm
|
@BIT-silence - Is this a copy of the C2 kernels? |
|
You are exposing native_layer_norm function to the user level api which seems to be requiring contiguous inputs. Please add tests to cover APi calls as well as contiguous checks. Alternatively if intend is to introduce fast kernel, implement all code as kernel and avoid creating a new native function. Also it would be nice to see benchmarks that compares calling old and new layer_norm code. |
Logically it is, while C2 kernel is implemented by using Eigen lib. I tested the performance difference, for the rowwise moments part, Eigen version will be faster than this version by using compiler's auto vectorization, elementwise affine part performs the same. While I also tried to use Vec256 in rowwise moments part, it actually a little slower than the for-loop with auto vectorization. |
0cfecc2 to
a17b02d
Compare
|
Thanks for the advice. I have removed native_layer_norm function. Later I will add the backward part for layer_norm then do autograd for layer_norm instead of batch_norm. Some benchmark result for this change. |
a17b02d to
f5f450f
Compare
soumith
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for making the changes. From my side this is good to go.
Actually I have removed native_layer_norm. Currently we don't expose new functions and add a fast path for layer_norm. Is that fine? |
|
We should ideally test that the fast path gives the same results as the slow path. But maybe this is implicitly tested in the cuda tests |
Actually not only cuda, but also jit test covers that. |
|
Since I will add the grad part for this fast path and then all the layer_norm on CPU should go through this path. As the current fast path is actually tested by existing tests, I'm wondering if there is any need to add some specific test for that. |
f5f450f to
efb7bac
Compare
efb7bac to
0351358
Compare
0351358 to
855256f
Compare
855256f to
c764776
Compare
c764776 to
d97a9e7
Compare
d97a9e7 to
f408af7
Compare
Summary: Pull Request resolved: pytorch#20345 Seperate from D15194600 Optimize pytorch layer_norm op part 1: optimize layer_norm_forward_cpu import Eigen Maps for the performance of reduction Reviewed By: zheng-xq Differential Revision: D15290608 fbshipit-source-id: d5589f67c515644403ff3ad11006ec43bab18809
f408af7 to
b075610
Compare
|
This pull request has been merged in c9da011. |
Summary: Pull Request resolved: pytorch/pytorch#20345 Seperate from D15194600 Optimize pytorch layer_norm op part 1: optimize layer_norm_forward_cpu import Eigen Maps for the performance of reduction Reviewed By: zheng-xq Differential Revision: D15290608 fbshipit-source-id: cf2c208dfd6fbcbc4c69db3ed60278d9bee156b5
Summary: Adds a quantized implementation of LayerNorm for server. Relevant PRs: * #20345 (floating point LN) * #33080 (quantized BN) A future PR will add the Python wrapper. Test Plan: numerics match the floating point implementation TODO: benchmarks Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: Adds a quantized implementation of LayerNorm for server. Relevant PRs: * #20345 (floating point LN) * #33080 (quantized BN) A future PR will add the Python wrapper. Test Plan: numerics match the floating point implementation TODO: benchmarks Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: Adds a quantized implementation of LayerNorm for server. Relevant PRs: * #20345 (floating point LN) * #33080 (quantized BN) A future PR will add the Python wrapper. Test Plan: numerics match the floating point implementation TODO: benchmarks Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: Adds a quantized implementation of LayerNorm for server. Relevant PRs: * #20345 (floating point LN) * #33080 (quantized BN) A future PR will add the Python wrapper. Test Plan: numerics match the floating point implementation TODO: benchmarks Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 3c3721f Pull Request resolved: #35329
Summary:
Seperate from D15194600
Optimize pytorch layer_norm op part 1:
optimize layer_norm_forward_cpu
import Eigen Maps for the performance of reduction
Differential Revision: D15290608