[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256 #29104

jianyuh · 2019-11-04T01:11:33Z

Stack from ghstack:

[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256 #29104 [bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256

We would like to provide the vectorized implementation for layer norm. This PR reuses #23349.

Single Core:
(Note that our benchmark generates batch_size=47 for first case and batch_size=56 for the second case. In spite of that, the vectorized version is still faster than the original reference C version without vectorization.)

Before the PR:

native_layer_norm        0.81%            5.884ms          0.81%            5.884ms          122.580us        NaN              0.000us          0.000us          48               [[47, 1, 1024], [1024], [1024]]

After the PR:

native_layer_norm        0.68%            5.053ms          0.68%            5.053ms          105.272us        NaN              0.000us          0.000us          48               [[56, 1, 1024], [1024], [1024]]

20 Cores:

Before the PR:

native_layer_norm        1.65%            41.682ms         1.65%            41.682ms         868.365us        NaN              0.000us          0.000us          48               [[61, 64, 1024], [1024], [1024]]

After the PR:

native_layer_norm        1.34%            33.829ms         1.34%            33.829ms         704.771us        NaN              0.000us          0.000us          48               [[61, 64, 1024], [1024], [1024]]

Differential Revision: D18293522

…ec256 We would like to provide the vectorized implementation for layer norm. This PR reuses #23349. Differential Revision: [D18293522](https://our.internmc.facebook.com/intern/diff/D18293522/) [ghstack-poisoned]

…ec256 We would like to provide the vectorized implementation for layer norm. This PR reuses #23349. Differential Revision: [D18293522](https://our.internmc.facebook.com/intern/diff/D18293522/) ghstack-source-id: 93164082 Pull Request resolved: #29104

aten/src/ATen/native/cpu/layer_norm_kernel.cpp

…ion using Vec256" We would like to provide the vectorized implementation for layer norm. This PR reuses #23349. Differential Revision: [D18293522](https://our.internmc.facebook.com/intern/diff/D18293522/) [ghstack-poisoned]

…ec256 Pull Request resolved: #29104 We would like to provide the vectorized implementation for layer norm. This PR reuses #23349. Differential Revision: [D18293522](https://our.internmc.facebook.com/intern/diff/D18293522/) ghstack-source-id: 93167727

…ion using Vec256" We would like to provide the vectorized implementation for layer norm. This PR reuses #23349. Differential Revision: [D18293522](https://our.internmc.facebook.com/intern/diff/D18293522/) [ghstack-poisoned]

…ec256 Pull Request resolved: #29104 We would like to provide the vectorized implementation for layer norm. This PR reuses #23349. Differential Revision: [D18293522](https://our.internmc.facebook.com/intern/diff/D18293522/) ghstack-source-id: 93170831

…ion using Vec256" We would like to provide the vectorized implementation for layer norm. This PR reuses #23349. Differential Revision: [D18293522](https://our.internmc.facebook.com/intern/diff/D18293522/) [ghstack-poisoned]

…ec256 Pull Request resolved: #29104 We would like to provide the vectorized implementation for layer norm. This PR reuses #23349. Differential Revision: [D18293522](https://our.internmc.facebook.com/intern/diff/D18293522/) ghstack-source-id: 93176468

aten/src/ATen/cpu/vec256/vec256_double.h

…ion using Vec256" We would like to provide the vectorized implementation for layer norm. This PR reuses #23349. Differential Revision: [D18293522](https://our.internmc.facebook.com/intern/diff/D18293522/) [ghstack-poisoned]

…ec256 Pull Request resolved: #29104 We would like to provide the vectorized implementation for layer norm. This PR reuses #23349. Differential Revision: [D18293522](https://our.internmc.facebook.com/intern/diff/D18293522/) ghstack-source-id: 93608529

jianyuh · 2019-11-10T05:30:35Z

I can reproduce it on a Skylake machine before my PR with

python run_test.py -i nn -- TestNN.test_LayerNorm_1d_no_elementwise_affine_eval

Output:

$ python run_test.py -i nn -- TestNN.test_LayerNorm_1d_no_elementwise_affine_eval
which: no nvcc in (/root/miniconda/bin:/root/miniconda/condabin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/facebook/ops/scripts:/usr/facebook/scripts:/opt/local/bin:/usr/facebook/scripts:/usr/facebook/scripts/db:/root/bin)
Test executor: ['/root/miniconda/bin/python']
Running test_nn ... [2019-11-09 21:25:55.865797]
F
======================================================================
FAIL: test_LayerNorm_1d_no_elementwise_affine_eval (__main__.TestNN)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_nn.py", line 7344, in <lambda>
    add(test_name, lambda self, test=test: test(self))
  File "/root/jhuang_test/pytorch/test/common_nn.py", line 3511, in __call__
    self.test_noncontig(test_case, module, input)
  File "/root/jhuang_test/pytorch/test/common_nn.py", line 3570, in test_noncontig
    test_case.assertEqual(out, output)
  File "/root/jhuang_test/pytorch/test/common_utils.py", line 736, in assertEqual
    assertTensorsEqual(x, y)
  File "/root/jhuang_test/pytorch/test/common_utils.py", line 706, in assertTensorsEqual
    self.assertLessEqual(max_err, prec, message)
AssertionError: tensor(76.1324, grad_fn=<MaxBackward1>) not less than or equal to 1e-05 :

----------------------------------------------------------------------
Ran 1 test in 0.009s

FAILED (failures=1)
Traceback (most recent call last):
  File "run_test.py", line 455, in <module>
    main()
  File "run_test.py", line 447, in main
    raise RuntimeError(message)
RuntimeError: test_nn failed!

…ion using Vec256" We would like to provide the vectorized implementation for layer norm. This PR reuses #23349. Differential Revision: [D18293522](https://our.internmc.facebook.com/intern/diff/D18293522/) [ghstack-poisoned]

…ec256 Pull Request resolved: #29104 We would like to provide the vectorized implementation for layer norm. This PR reuses #23349. Differential Revision: [D18293522](https://our.internmc.facebook.com/intern/diff/D18293522/) ghstack-source-id: 93611515

jianyuh · 2019-11-10T09:22:55Z

Fixed the issue by reusing vec256::reduce_all and vec256::map_reduce_all. But there might be overhead as the fusion between two loops is dismissed.

jamesr66a

LGTM. Does this improve some benchmarks?

Will check the performance.

jianyuh · 2019-11-21T05:47:16Z

LGTM. Does this improve some benchmarks?

Will check the performance before landing. Thanks!

…ion using Vec256" We would like to provide the vectorized implementation for layer norm. This PR reuses #23349. Differential Revision: [D18293522](https://our.internmc.facebook.com/intern/diff/D18293522/) [ghstack-poisoned]

jianyuh · 2019-12-11T06:53:38Z

Update the performance number in the summary.

…ion using Vec256" We would like to provide the vectorized implementation for layer norm. This PR reuses #23349. Single Core: (Note that our benchmark generates batch_size=47 for first case and batch_size=56 for the second case. In spite of that, the vectorized version is still faster than the original reference C version without vectorization.) - Before the PR: ``` native_layer_norm 0.81% 5.884ms 0.81% 5.884ms 122.580us NaN 0.000us 0.000us 48 [[47, 1, 1024], [1024], [1024]] ``` - After the PR: ``` native_layer_norm 0.68% 5.053ms 0.68% 5.053ms 105.272us NaN 0.000us 0.000us 48 [[56, 1, 1024], [1024], [1024]] ``` 20 Cores: - Before the PR: ``` native_layer_norm 1.65% 41.682ms 1.65% 41.682ms 868.365us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` - After the PR: ``` native_layer_norm 1.34% 33.829ms 1.34% 33.829ms 704.771us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` Differential Revision: [D18293522](https://our.internmc.facebook.com/intern/diff/D18293522/) [ghstack-poisoned]

…ec256 Pull Request resolved: #29104 We would like to provide the vectorized implementation for layer norm. This PR reuses #23349. Differential Revision: [D18293522](https://our.internmc.facebook.com/intern/diff/D18293522/) ghstack-source-id: 95345939

facebook-github-bot · 2019-12-11T19:51:05Z

This pull request has been merged in d6d6075.

…29104) Summary: Pull Request resolved: pytorch#29104 We would like to provide the vectorized implementation for layer norm. This PR reuses pytorch#23349. Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "LayerNorm" buck test mode/dev-nosan //caffe2/test:nn -- "test_LayerNorm_1d_no_elementwise_affine_eval" python run_test.py -i nn -- TestNN.test_LayerNorm_1d_no_elementwise_affine_eval Differential Revision: D18293522 fbshipit-source-id: f4cfed6e62bac1b43ee00c32b495ecc836bd9ec5

jianyuh mentioned this pull request Nov 4, 2019

[pt][aten] Enable the intra-op parallelism for layer norm #28464

Closed

jianyuh requested a review from xiaomengy November 4, 2019 01:12

xiaomengy previously requested changes Nov 4, 2019

View reviewed changes

jamesr66a reviewed Nov 4, 2019

View reviewed changes

aten/src/ATen/cpu/vec256/vec256_double.h Outdated Show resolved Hide resolved

jianyuh mentioned this pull request Nov 4, 2019

[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256 (2/2) #29154

Closed

jianyuh mentioned this pull request Nov 10, 2019

[bert/RoBERTa] Optimize LayerNorm (Backward) with explicit vectorization using Vec256 #29519

Closed

jianyuh requested a review from xiaomengy November 10, 2019 09:23

jamesr66a approved these changes Nov 20, 2019

View reviewed changes

facebook-github-bot closed this in d6d6075 Dec 11, 2019

facebook-github-bot added the merged label Dec 11, 2019

jianyuh mentioned this pull request Dec 12, 2019

Re-apply "[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256" #31127

Closed

facebook-github-bot deleted the gh/jianyuh/44/head branch December 14, 2019 15:15

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256 #29104

[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256 #29104

Uh oh!

jianyuh commented Nov 4, 2019 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jianyuh commented Nov 10, 2019 •

edited

Loading

Uh oh!

jianyuh commented Nov 10, 2019 •

edited

Loading

Uh oh!

jamesr66a left a comment

Uh oh!

jianyuh commented Nov 21, 2019

Uh oh!

jianyuh commented Dec 11, 2019

Uh oh!

facebook-github-bot commented Dec 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256 #29104

[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256 #29104

Uh oh!

Conversation

jianyuh commented Nov 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jianyuh commented Nov 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jianyuh commented Nov 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamesr66a left a comment

Choose a reason for hiding this comment

Uh oh!

jianyuh commented Nov 21, 2019

Uh oh!

jianyuh commented Dec 11, 2019

Uh oh!

facebook-github-bot commented Dec 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jianyuh commented Nov 4, 2019 •

edited

Loading

jianyuh commented Nov 10, 2019 •

edited

Loading

jianyuh commented Nov 10, 2019 •

edited

Loading