-
Notifications
You must be signed in to change notification settings - Fork 26.3k
LayerNorm #2112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LayerNorm #2112
Conversation
|
Pinging @jekbradbury to read the discussion in the previous PR and do a sanity check on this one. |
torch/nn/modules/instancenorm.py
Outdated
| At evaluation time (`.eval()`), the default behaviour of the LayerNorm | ||
| module stays the same, i.e. the running mean/variance is NOT used for | ||
| normalization. One can force using the stored mean and variance with | ||
| the `.train(False)` method. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
I'm not entirely sold on reusing the InstanceNorm code to do LayerNorm. I think Sam's comments are right, and additionally I'm afraid that keeping running statistics (which is inconsistent with the motivation of LayerNorm) would be unnecessary overhead. LayerNorm is a surprisingly significant fraction of compute time for Transformer-like networks already; it might be worth doing a microbenchmark to make sure that it's as fast as we can easily make it. |
|
Layer norm and instance norm are similar normalisation strategies that are batch-independent and therefore should not require running statistics. As noted above, this was not working for To clarify the difference between these two, layer norm works on 2D inputs, normalising over the 2nd dim, whilst instance norm works on 3/4/5D inputs, normalising over all except for the first 2 dims. There I propose the following for this PR:
|
|
@jekbradbury, if LayerNorm is becoming a computational bottleneck, may be it makes sense to put it in the backend? Should be pretty similar to batchNorm. |
|
I must be missing something here, but why are we moving from using batch norm as a backend? The overhead in batch norm to keep running statistics should be small (at least as long as batch_size * num_channels >> feature_size), but everything is done in 1 kernel, which should be a big performance advantage over 1 kernel to calculate mean, a handful to calculate std and another handful to normalize. Using cunn backend might be better here than cudnn, since it does not care if batch_size is 1 (which it will be after reshaping), IIRC, cudnn will have noticeably worse perf with batch_size 1. |
|
@ngimel What would you suggest, considering instance norm uses a batch size of 1 and I can also imagine situations where |
|
https://github.com/pytorch/pytorch/blob/master/torch/nn/functional.py#L560-L563 just links to autograd batchnorm, and there it's not too hard to not use cudnn if batch size == 1: https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/functions/batch_normalization.cpp#L55-L58. |
|
@pytorchbot add to whitelist |
|
Note for when coming back to this - may want to make LayerNorm the same for {1, 2, 3}D for the same reasons as underlined in #2628. |
|
I cursively read the comments, but still don't understand why two separate PRs. |
|
Closing in favour of #4922. |
WIP. See #2019 for details on this.