Skip to content

Conversation

@Kaixhin
Copy link
Contributor

@Kaixhin Kaixhin commented Jul 15, 2017

WIP. See #2019 for details on this.

@Kaixhin
Copy link
Contributor Author

Kaixhin commented Jul 15, 2017

Pinging @jekbradbury to read the discussion in the previous PR and do a sanity check on this one.

At evaluation time (`.eval()`), the default behaviour of the LayerNorm
module stays the same, i.e. the running mean/variance is NOT used for
normalization. One can force using the stored mean and variance with
the `.train(False)` method.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@jekbradbury
Copy link
Contributor

I'm not entirely sold on reusing the InstanceNorm code to do LayerNorm. I think Sam's comments are right, and additionally I'm afraid that keeping running statistics (which is inconsistent with the motivation of LayerNorm) would be unnecessary overhead. LayerNorm is a surprisingly significant fraction of compute time for Transformer-like networks already; it might be worth doing a microbenchmark to make sure that it's as fast as we can easily make it.

@Kaixhin
Copy link
Contributor Author

Kaixhin commented Jul 17, 2017

Layer norm and instance norm are similar normalisation strategies that are batch-independent and therefore should not require running statistics. As noted above, this was not working for nn.InstanceNorm{1,2,3}d anyway. Hence, we can move away from using batch norm as a backend, and furthermore, we can make them parameterless by default.

To clarify the difference between these two, layer norm works on 2D inputs, normalising over the 2nd dim, whilst instance norm works on 3/4/5D inputs, normalising over all except for the first 2 dims.

There I propose the following for this PR:

  • Replace the batch norm backend for instance norm with a separate function (F.instance_norm), which is by default parameterless and hence can work with arbitrary 3+D inputs.
  • Implement layer norm using @jekbradbury's simple code from [Feature Request] Layer Normalization #1959. It seems simpler to do this separately (in F.layer_norm), rather than adapt F.instance_norm to work with less than 3 dimensions.
  • Add basic module tests for nn.LayerNorm and nn.InstanceNorm{1,2,3}d.
  • Fix test_LayerNorm.
  • Fix _test_InstanceNorm.
  • Tidy up the normalisation docs to make what happens in layer and instance norm clear.

@ngimel
Copy link
Collaborator

ngimel commented Jul 17, 2017

@jekbradbury, if LayerNorm is becoming a computational bottleneck, may be it makes sense to put it in the backend? Should be pretty similar to batchNorm.

@ngimel
Copy link
Collaborator

ngimel commented Jul 17, 2017

I must be missing something here, but why are we moving from using batch norm as a backend? The overhead in batch norm to keep running statistics should be small (at least as long as batch_size * num_channels >> feature_size), but everything is done in 1 kernel, which should be a big performance advantage over 1 kernel to calculate mean, a handful to calculate std and another handful to normalize. Using cunn backend might be better here than cudnn, since it does not care if batch_size is 1 (which it will be after reshaping), IIRC, cudnn will have noticeably worse perf with batch_size 1.

@Kaixhin
Copy link
Contributor Author

Kaixhin commented Jul 18, 2017

@ngimel What would you suggest, considering instance norm uses a batch size of 1 and batch_norm doesn't (sensibly so) provide a way to turn off cudnn?

I can also imagine situations where batch_size * num_channels << feature_size - in the lower layers of a CNN where you're looking at activation statistics for style transfer for example. James may also be able to say more about his use cases.

@ngimel
Copy link
Collaborator

ngimel commented Jul 18, 2017

https://github.com/pytorch/pytorch/blob/master/torch/nn/functional.py#L560-L563 just links to autograd batchnorm, and there it's not too hard to not use cudnn if batch size == 1: https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/functions/batch_normalization.cpp#L55-L58.
Oops, had << the wrong way, if batch_size*num_channels << feature_size, overhead is small, the bigger num_channels and smaller feature_size, the bigger the overhead is (because the size of the running_vars would be batch_size * num_channels), so it is the upper layers of cnn that would suffer. Idk, the effort one might want to spend on it depends on whether it is a bottleneck or not, if it is, then modifying cunn's backend not to track running vars might be worth it.

@Kaixhin Kaixhin mentioned this pull request Jul 18, 2017
@soumith
Copy link
Contributor

soumith commented Jul 19, 2017

@pytorchbot add to whitelist

@Kaixhin
Copy link
Contributor Author

Kaixhin commented Oct 22, 2017

Note for when coming back to this - may want to make LayerNorm the same for {1, 2, 3}D for the same reasons as underlined in #2628.

@netheril96
Copy link

I cursively read the comments, but still don't understand why two separate PRs.

@Kaixhin
Copy link
Contributor Author

Kaixhin commented Jan 30, 2018

Closing in favour of #4922.

@Kaixhin Kaixhin closed this Jan 30, 2018
@Kaixhin Kaixhin deleted the ln branch January 30, 2018 05:11
@Kaixhin Kaixhin restored the ln branch January 30, 2018 05:11
IvanYashchuk pushed a commit to IvanYashchuk/pytorch that referenced this pull request Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants