LayerNorm #2112

Kaixhin · 2017-07-15T12:19:37Z

WIP. See #2019 for details on this.

Kaixhin · 2017-07-15T13:46:17Z

Pinging @jekbradbury to read the discussion in the previous PR and do a sanity check on this one.

torch/nn/modules/instancenorm.py

+    At evaluation time (`.eval()`), the default behaviour of the LayerNorm
+    module stays the same, i.e. the running mean/variance is NOT used for
+    normalization. One can force using the stored mean and variance with
+    the `.train(False)` method.


jekbradbury · 2017-07-16T04:58:51Z

I'm not entirely sold on reusing the InstanceNorm code to do LayerNorm. I think Sam's comments are right, and additionally I'm afraid that keeping running statistics (which is inconsistent with the motivation of LayerNorm) would be unnecessary overhead. LayerNorm is a surprisingly significant fraction of compute time for Transformer-like networks already; it might be worth doing a microbenchmark to make sure that it's as fast as we can easily make it.

Kaixhin · 2017-07-17T16:13:16Z

Layer norm and instance norm are similar normalisation strategies that are batch-independent and therefore should not require running statistics. As noted above, this was not working for nn.InstanceNorm{1,2,3}d anyway. Hence, we can move away from using batch norm as a backend, and furthermore, we can make them parameterless by default.

To clarify the difference between these two, layer norm works on 2D inputs, normalising over the 2nd dim, whilst instance norm works on 3/4/5D inputs, normalising over all except for the first 2 dims.

There I propose the following for this PR:

Replace the batch norm backend for instance norm with a separate function (F.instance_norm), which is by default parameterless and hence can work with arbitrary 3+D inputs.
Implement layer norm using @jekbradbury's simple code from [Feature Request] Layer Normalization #1959. It seems simpler to do this separately (in F.layer_norm), rather than adapt F.instance_norm to work with less than 3 dimensions.
Add basic module tests for nn.LayerNorm and nn.InstanceNorm{1,2,3}d.
Fix test_LayerNorm.
Fix _test_InstanceNorm.
Tidy up the normalisation docs to make what happens in layer and instance norm clear.

ngimel · 2017-07-17T17:11:33Z

@jekbradbury, if LayerNorm is becoming a computational bottleneck, may be it makes sense to put it in the backend? Should be pretty similar to batchNorm.

ngimel · 2017-07-17T23:59:06Z

I must be missing something here, but why are we moving from using batch norm as a backend? The overhead in batch norm to keep running statistics should be small (at least as long as batch_size * num_channels >> feature_size), but everything is done in 1 kernel, which should be a big performance advantage over 1 kernel to calculate mean, a handful to calculate std and another handful to normalize. Using cunn backend might be better here than cudnn, since it does not care if batch_size is 1 (which it will be after reshaping), IIRC, cudnn will have noticeably worse perf with batch_size 1.

Kaixhin · 2017-07-18T00:34:19Z

@ngimel What would you suggest, considering instance norm uses a batch size of 1 and batch_norm doesn't (sensibly so) provide a way to turn off cudnn?

I can also imagine situations where batch_size * num_channels << feature_size - in the lower layers of a CNN where you're looking at activation statistics for style transfer for example. James may also be able to say more about his use cases.

ngimel · 2017-07-18T00:44:20Z

https://github.com/pytorch/pytorch/blob/master/torch/nn/functional.py#L560-L563 just links to autograd batchnorm, and there it's not too hard to not use cudnn if batch size == 1: https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/functions/batch_normalization.cpp#L55-L58.
Oops, had << the wrong way, if batch_size*num_channels << feature_size, overhead is small, the bigger num_channels and smaller feature_size, the bigger the overhead is (because the size of the running_vars would be batch_size * num_channels), so it is the upper layers of cnn that would suffer. Idk, the effort one might want to spend on it depends on whether it is a bottleneck or not, if it is, then modifying cunn's backend not to track running vars might be worth it.

soumith · 2017-07-19T20:14:48Z

@pytorchbot add to whitelist

Kaixhin · 2017-10-22T14:24:05Z

Note for when coming back to this - may want to make LayerNorm the same for {1, 2, 3}D for the same reasons as underlined in #2628.

netheril96 · 2018-01-09T08:24:30Z

I cursively read the comments, but still don't understand why two separate PRs.

Kaixhin · 2018-01-30T05:11:48Z

Closing in favour of #4922.

Kaixhin added 3 commits July 15, 2017 13:18

Add note on layer vs instance norm

d3f6c1a

Add ln tests

1458bca

Fix import and lint

6c1fd20

Remove - in std-dev

c375296

colesbury reviewed Jul 15, 2017

View reviewed changes

Kaixhin added 13 commits July 17, 2017 10:33

Port k/master -> k/ln

8ad2b12

Add layer_norm dim tests

274d2e4

Remove momentum from LN args

4bb251f

Add InstanceNorm into new_module_tests

16220e6

Make layer/instance norm parameterless by default

b47b875

IN moved to functional as LN does not need dim manipulation

f6a55c4

Fix LN forward

7ff99db

Fix batch expansion

84ac237

Fix indentation for lint

c03a94c

Fix typo in _expand_input...

22e6880

Correct mean/std calc in _test_InstanceNorm

4468b3c

Use broadcast for LN

0f921d9

Add LayerNorm test

b28b8f5

Add F.layer_norm

5b25989

Kaixhin mentioned this pull request Jul 18, 2017

Add LayerNorm (1D only) #2019

Closed

Tidy up _expand_input_to_target

59a9750

Kaixhin mentioned this pull request Jan 9, 2018

Feature request: instance norm without the computation of running statistics #4509

Closed

ssnl mentioned this pull request Jan 30, 2018

[ready] Layer Normalization #4922

Merged

Kaixhin closed this Jan 30, 2018

Kaixhin deleted the ln branch January 30, 2018 05:11

Kaixhin restored the ln branch January 30, 2018 05:11

ezyang added the open source label Jun 24, 2019

IvanYashchuk pushed a commit to IvanYashchuk/pytorch that referenced this pull request Oct 25, 2022

View squeezed and don't reduce (pytorch#2112)

20bd827

LayerNorm #2112

LayerNorm #2112

Uh oh!

Conversation

Kaixhin commented Jul 15, 2017

Uh oh!

Kaixhin commented Jul 15, 2017

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

jekbradbury commented Jul 16, 2017

Uh oh!

Kaixhin commented Jul 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Jul 17, 2017

Uh oh!

ngimel commented Jul 17, 2017

Uh oh!

Kaixhin commented Jul 18, 2017

Uh oh!

ngimel commented Jul 18, 2017

Uh oh!

soumith commented Jul 19, 2017

Uh oh!

Kaixhin commented Oct 22, 2017

Uh oh!

netheril96 commented Jan 9, 2018

Uh oh!

Kaixhin commented Jan 30, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Kaixhin commented Jul 17, 2017 •

edited

Loading