Optimize pytorch layer_norm forward #20345

xiaomengy · 2019-05-10T01:06:25Z

Summary:
Seperate from D15194600
Optimize pytorch layer_norm op part 1:
optimize layer_norm_forward_cpu
import Eigen Maps for the performance of reduction

Differential Revision: D15290608

soumith

this PR now introduces Eigen into pytorch optimizations -- without discussion.
The previous comments of moving to native/cpu and using dispatcher are not addressed

xiaomengy · 2019-05-15T17:23:57Z

Removed Eigen now. For reduce case, I found that even using Vec256 is still slower than Eigen. So maybe we can consider to discuss or figure out how to make it better.

VitalyFedyunin · 2019-05-15T18:11:02Z

aten/src/ATen/native/layer_norm.cpp

Please move LayerNormForwardCPUImpl and all CPU logic into the /cpu subfolder using DECLARE_DISPATCH logic (see CopyKernel for the reference). It will allow to utilize AVX2 instructions and other optimizations of OSS build.

VitalyFedyunin · 2019-05-15T18:17:10Z

aten/src/ATen/native/layer_norm.cpp

I'm quite sure you want to call at:native::empty_like here and avoid all tracing/profiling checks.

VitalyFedyunin

This PR requires new tests to be written for native_layer_norm

cpuhrsch · 2019-05-15T18:26:56Z

@BIT-silence - Is this a copy of the C2 kernels?

VitalyFedyunin · 2019-05-15T18:28:11Z

You are exposing native_layer_norm function to the user level api which seems to be requiring contiguous inputs. Please add tests to cover APi calls as well as contiguous checks. Alternatively if intend is to introduce fast kernel, implement all code as kernel and avoid creating a new native function.

Also it would be nice to see benchmarks that compares calling old and new layer_norm code.

xiaomengy · 2019-05-15T18:29:40Z

@BIT-silence - Is this a copy of the C2 kernels?

Logically it is, while C2 kernel is implemented by using Eigen lib. I tested the performance difference, for the rowwise moments part, Eigen version will be faster than this version by using compiler's auto vectorization, elementwise affine part performs the same. While I also tried to use Vec256 in rowwise moments part, it actually a little slower than the for-loop with auto vectorization.

xiaomengy · 2019-05-16T00:57:23Z

Thanks for the advice. I have removed native_layer_norm function. Later I will add the backward part for layer_norm then do autograd for layer_norm instead of batch_norm.

Some benchmark result for this change.
input shape = [64, 128, 56, 56] and normalized_shape = [128, 56, 56] with elementwise_affine=True,
on devvm forward time from 350ms to 87.6ms. And with this approach, it can let hugging face BERT model forward about 12% faster.

soumith

thanks for making the changes. From my side this is good to go.

xiaomengy · 2019-05-16T21:18:50Z

This PR requires new tests to be written for native_layer_norm

Actually I have removed native_layer_norm. Currently we don't expose new functions and add a fast path for layer_norm. Is that fine?

fmassa · 2019-05-16T21:22:56Z

We should ideally test that the fast path gives the same results as the slow path. But maybe this is implicitly tested in the cuda tests

xiaomengy · 2019-05-16T21:24:10Z

We should ideally test that the fast path gives the same results as the slow path. But maybe this is implicitly tested in the cuda tests

Actually not only cuda, but also jit test covers that.

xiaomengy · 2019-05-16T21:38:14Z

Since I will add the grad part for this fast path and then all the layer_norm on CPU should go through this path. As the current fast path is actually tested by existing tests, I'm wondering if there is any need to add some specific test for that.

Summary: Pull Request resolved: pytorch#20345 Seperate from D15194600 Optimize pytorch layer_norm op part 1: optimize layer_norm_forward_cpu import Eigen Maps for the performance of reduction Reviewed By: zheng-xq Differential Revision: D15290608 fbshipit-source-id: d5589f67c515644403ff3ad11006ec43bab18809

facebook-github-bot · 2019-05-21T23:43:42Z

This pull request has been merged in c9da011.

Summary: Pull Request resolved: pytorch/pytorch#20345 Seperate from D15194600 Optimize pytorch layer_norm op part 1: optimize layer_norm_forward_cpu import Eigen Maps for the performance of reduction Reviewed By: zheng-xq Differential Revision: D15290608 fbshipit-source-id: cf2c208dfd6fbcbc4c69db3ed60278d9bee156b5

Summary: Adds a quantized implementation of LayerNorm for server. Relevant PRs: * #20345 (floating point LN) * #33080 (quantized BN) A future PR will add the Python wrapper. Test Plan: numerics match the floating point implementation TODO: benchmarks Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Adds a quantized implementation of LayerNorm for server. Relevant PRs: * #20345 (floating point LN) * #33080 (quantized BN) A future PR will add the Python wrapper. Test Plan: numerics match the floating point implementation TODO: benchmarks Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 3c3721f Pull Request resolved: #35329

pytorchbot added the module: operators label May 10, 2019

soumith suggested changes May 10, 2019

View reviewed changes

xiaomengy force-pushed the export-D15290608 branch from c90fa95 to addfd9d Compare May 15, 2019 09:23

xiaomengy force-pushed the export-D15290608 branch from addfd9d to 0cfecc2 Compare May 15, 2019 09:26

xiaomengy requested a review from soumith May 15, 2019 17:23

soumith requested review from VitalyFedyunin and cpuhrsch May 15, 2019 17:54

VitalyFedyunin reviewed May 15, 2019

View reviewed changes

VitalyFedyunin suggested changes May 15, 2019

View reviewed changes

xiaomengy force-pushed the export-D15290608 branch from 0cfecc2 to a17b02d Compare May 16, 2019 00:41

pytorchbot added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 16, 2019

xiaomengy requested a review from VitalyFedyunin May 16, 2019 00:57

xiaomengy force-pushed the export-D15290608 branch from a17b02d to f5f450f Compare May 16, 2019 02:15

soumith approved these changes May 16, 2019

View reviewed changes

xiaomengy force-pushed the export-D15290608 branch from f5f450f to efb7bac Compare May 16, 2019 23:29

xiaomengy force-pushed the export-D15290608 branch from efb7bac to 0351358 Compare May 16, 2019 23:40

xiaomengy force-pushed the export-D15290608 branch from 0351358 to 855256f Compare May 16, 2019 23:52

xiaomengy force-pushed the export-D15290608 branch from 855256f to c764776 Compare May 17, 2019 01:12

xiaomengy force-pushed the export-D15290608 branch from c764776 to d97a9e7 Compare May 17, 2019 06:13

xiaomengy force-pushed the export-D15290608 branch from d97a9e7 to f408af7 Compare May 18, 2019 00:03

xiaomengy force-pushed the export-D15290608 branch from f408af7 to b075610 Compare May 21, 2019 20:40

facebook-github-bot closed this in c9da011 May 21, 2019

facebook-github-bot added the merged label May 21, 2019

vkuzo mentioned this pull request Mar 24, 2020

add quantized layer norm implementation #35329

Closed

mruberry added the Merged label Oct 28, 2020

Optimize pytorch layer_norm forward #20345

Optimize pytorch layer_norm forward #20345

Uh oh!

Conversation

xiaomengy commented May 10, 2019

Uh oh!

soumith left a comment

Choose a reason for hiding this comment

Uh oh!

xiaomengy commented May 15, 2019

Uh oh!

VitalyFedyunin May 15, 2019

Choose a reason for hiding this comment

Uh oh!

xiaomengy May 16, 2019

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin May 15, 2019

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Uh oh!

cpuhrsch commented May 15, 2019

Uh oh!

VitalyFedyunin commented May 15, 2019

Uh oh!

xiaomengy commented May 15, 2019

Uh oh!

xiaomengy commented May 16, 2019

Uh oh!

soumith left a comment

Choose a reason for hiding this comment

Uh oh!

xiaomengy commented May 16, 2019

Uh oh!

fmassa commented May 16, 2019

Uh oh!

xiaomengy commented May 16, 2019

Uh oh!

xiaomengy commented May 16, 2019

Uh oh!

facebook-github-bot commented May 21, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants