Add gelu activation in pytorch #20665

xiaomengy · 2019-05-17T22:58:29Z

Summary:
Add gelu activation forward on CPU in pytorch

Compare to current python implemented version of gelu in BERT model like

def gelu(self, x):
x * 0.5 * (1.0 + torch.erf(x / self.sqrt_two))

The torch.gelu function can reduce the forward time from 333ms to 112ms (with MKL) / 133ms (without MKL) for input size = [64, 128, 56, 56] on a devvm.

Differential Revision: D15400974

soumith

looks pretty good!

You need to add documentation / stub function in functional.py
For example, see GLU : https://github.com/pytorch/pytorch/blob/master/torch/nn/functional.py#L954-L976

soumith · 2019-05-18T22:34:29Z

aten/src/ATen/native/native_functions.yaml

you should add a python_module: nn so that it comes into torch._C._nn namespace instead of global namespace torch.. For example https://github.com/pytorch/pytorch/blob/6b8c1eab83100ae52c677f83786ea8dbbe2df5be/aten/src/ATen/native/native_functions.yaml#L3557-L3558

Thanks for pointing out this.

soumith · 2019-05-18T22:35:57Z

test/test_torch.py

once you made the changes above, gelu will be torch.nn.functional.gelu

fmassa · 2019-05-19T17:27:57Z

Could you also add a GPU implementation for it? We try to keep parity between CPU and CUDA as much as possible.

Also, what's the story around the JIT fusion for CPU? It's disabled by default in PyTorch, but the fuser is currently able to handle all the operations in here inside a single fusion group, and seems to bring some significant speedups for CPU (and the same also applies to CUDA)

import torch

# should we enable it by default?
torch._C._jit_override_can_fuse_on_cpu(True)

def gelu(x):
    sqrt_two = 1.4142135623730951
    return x * 0.5 * (1.0 + torch.erf(x / sqrt_two))

@torch.jit.script
def gelu2(x):
    sqrt_two = 1.4142135623730951
    return x * 0.5 * (1.0 + torch.erf(x / sqrt_two))

x = torch.rand(64, 128, 56, 56)

# compile gelu2
gelu2(x)

and gives

%timeit gelu(x)
> 92.6 ms ± 647 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit gelu2(x)
> 18.5 ms ± 518 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

fmassa · 2019-05-19T17:31:14Z

aten/src/ATen/native/cpu/Activation.cpp

Don't we need to check that the tensors are contiguous before dispatching to the MKL-optimized codepath?

This needs a comment; I would expect something like:

describe if MKL supports non-contiguous inputs or not.
2a) if it doesn't, when is it worth it to make the tensor contiguous to do the op? Does it ever or should I just check contiguity in the MKL pass?
2b) if it does, why don't I just pass in the tensor?

I will add the cuda impl in the next PR. Thanks for the advice.

fmassa · 2019-05-19T17:33:59Z

test/test_torch.py

Can you also test non-contiguous tensors? Like

x = torch.rand(50, 50)[:, ::2]

It looks like this always coverts the input to be contiguous, although it probably doesn't need to since it's now using TensorIterator.

I added test for non-contiguous inputs.

gchanan · 2019-05-20T17:22:57Z

aten/src/ATen/native/Activation.cpp

is this .contiguous() still needed after your latest changes?

I think so, since currently our approach is using for-loop to take advantage of autovectorization for the non-mkl path. Actually currently with or without MKL, the performance is pretty similar.

xiaomengy · 2019-05-27T18:22:21Z

For the current version, the MKL path in the same machine need 109ms while the non-MKL path need 112ms, currently they are quite similar.

soumith

ship when tests pass.

soumith · 2019-06-01T01:28:53Z

aten/src/ATen/native/cpu/Activation.cpp

that's not conflicting, as in you can use TensorIterator, where the specialization for a contiguous block can call v?CdfNorm if MKL is available. It's decoupled from using vec256.h.

But just do that in a follow-up diff, instead of this one, as it's an independent work unit anyways.

soumith · 2019-06-01T01:29:40Z

torch/nn/functional.py

where \Phi(x) is the Cumulative Distribution Function for Gaussian Distribution.

soumith · 2019-06-01T01:31:27Z

torch/nn/functional.py

add an entry in https://raw.githubusercontent.com/pytorch/pytorch/master/docs/source/nn.rst similar to elu:

:hidden:`gelu` ~~~~~~~~~~~~~ .. autofunction:: gelu

Thanks! Done.

Summary: Pull Request resolved: pytorch#20665 Add gelu activation forward on CPU in pytorch Compare to current python implemented version of gelu in BERT model like def gelu(self, x): x * 0.5 * (1.0 + torch.erf(x / self.sqrt_two)) The torch.nn.functional.gelu function can reduce the forward time from 333ms to 109ms (with MKL) / 112ms (without MKL) for input size = [64, 128, 56, 56] on a devvm. Reviewed By: zheng-xq Differential Revision: D15400974 fbshipit-source-id: 78399123aef803376a2459d487d44557126070ac

Summary: Pull Request resolved: pytorch/pytorch#20665 Add gelu activation forward on CPU in pytorch Compare to current python implemented version of gelu in BERT model like def gelu(self, x): x * 0.5 * (1.0 + torch.erf(x / self.sqrt_two)) The torch.nn.functional.gelu function can reduce the forward time from 333ms to 109ms (with MKL) / 112ms (without MKL) for input size = [64, 128, 56, 56] on a devvm. Reviewed By: zheng-xq Differential Revision: D15400974 fbshipit-source-id: f606b43d1dd64e3c42a12c4991411d47551a8121

facebook-github-bot · 2019-06-02T19:04:26Z

This pull request has been merged in 93ae040.

gchanan · 2019-06-06T16:16:17Z

aten/src/ATen/native/cuda/Activation.cu

+namespace {
+
+template <typename T>
+void GeluCUDAKernelImplInternal(const Tensor& X, Tensor* Y) {


can you avoid passing Tensors as Tensor *? It's not standard C++ PyTorch code (is there any other example in the codebase that does this?). Depending on the use case, you can use const Tensor &, Tensor & or Tensor.

I just want to make things clear that pass by pointer means it is a output variable or it will be changed in the function. Just for readability reason. I can change it to Tensor&.

https://google.github.io/styleguide/cppguide.html#Output_Parameters

See: https://github.com/pytorch/pytorch/wiki/Writing-Python-in-cpp-(a-manifesto)

Also, if you look at any _out function (translates to python with out= parameter), we use Tensor& already. Although note that this kind of bogus, because reassigning the reference is almost never correct, but you should just follow the convention for now.

But the right way to think about this is you are already passing a (smart) pointer. Passing a pointer to a smart pointer is almost never what you want. And as noted in the link, const is essentially meaningless here, so trying to use static types for readability doesn't really work either (unless you implement ConstTensor).

BramVanroy · 2019-09-22T09:51:34Z

Is this implemented in 1.2.0? I can find it in documentation (https://pytorch.org/docs/stable/nn.functional.html) but I can't import it or find it in my installed library.

cpuhrsch · 2019-09-23T17:24:23Z

@BramVanroy - is this it?

BramVanroy · 2019-09-23T17:31:50Z

@BramVanroy - is this it?

Odd. My IDE (PyCharm) underlines gelu in red and says "Cannot find reference 'gelu' in functional.pyi", but when I run the code it seems to import just fine.

from torch.nn.functional import gelu

BramVanroy · 2019-09-28T10:48:34Z

PyTorch's current implementation is

def gelu(x):
    return 0.5 * x * (1.0 + torch.erf(x / math.sqrt(2.0)))

rather than

def gelu(x):
    return 0.5 * x * (1 + torch.tanh(math.sqrt(math.pi / 2) * (x + 0.044715 * x ** 3)))

Correct? I've seen both being mentioned, but I'm not sure which one is implemented in PyTorch. IIRC Google's BERT originally uses the former, and OpenAI's GPT the latter.

xiaomengy · 2019-09-28T22:40:19Z

PyTorch implementation is the original definition of GELU which is x * P(X <= x) where X ~ N(0, 1). This one is mathematically equivalent to 0.5 * x * (1.0 + torch.erf(x / math.sqrt(2.0))). The tanh is an approximation of GELU which may lead to better performance as mentioned in https://arxiv.org/pdf/1606.08415.pdf. However, in our test, the performance improvement depends on how tanh function is implemented. So we didn't use that by default.

BramVanroy · 2019-09-29T17:43:28Z

@BIT-silence Thank you for the reply. That clarifies things a lot. A final question, is there a reason that gelu doesn't have a Module equivalent that lives in nn? (Like nn.ReLU.)

xiaomengy · 2019-09-30T00:49:08Z

Actually the main reason is we don't have enough time when adding it. We will consider add the Module later.

BramVanroy · 2019-09-30T07:43:35Z

Okay, thanks for the information. I wasn't sure whether in general there is a reason when an activation function has a Module equivalent, but it seems that it's mostly a time-constraint.

soumith · 2019-09-30T14:34:32Z

@BramVanroy generally we only keep functionals for layers which don't have learnable parameters. We used to add layers for all common functions, like nn.ReLU but that's legacy

BramVanroy · 2019-09-30T15:30:16Z

@BramVanroy generally we only keep functionals for layers which don't have learnable parameters. We used to add layers for all common functions, like nn.ReLU but that's legacy

That makes sense. However, I do like that printing a Module, gives you a nice overview of its submodules. If the activation is a Module it is included (and clear to the user what it is). If it's a function, it's not included unfortunately - so you're left wondering whether (and where) an activation is taking place. Compare the two snippets below

# as Module
from torch import nn

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.dense = nn.Linear(512, 1)
        self.activation = nn.ReLU()

    def forward(self, inputs):
        out = self.dense(inputs)
        out = self.activation(out)
        return out


net = Net()
print(net)

Prints

Net(
  (dense): Linear(in_features=512, out_features=1, bias=True)
  (activation): ReLU()
)

But when using functions, you don't see which activations are used nor where.

from torch import nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.dense = nn.Linear(512, 1)
        self.activation = F.relu

    def forward(self, inputs):
        out = self.dense(inputs)
        out = self.activation(out)
        return out


net = Net()
print(net)

Prints

Net(
  (dense): Linear(in_features=512, out_features=1, bias=True)
)

xiaomengy · 2019-10-31T04:02:22Z

The module which is torch.nn.GELU is implemented in #28944

BramVanroy · 2019-10-31T06:35:50Z

Great! It'll ship with 1.4 then, I presume?

pytorchbot added module: cpu CPU specific problem (e.g., perf, algorithm) module: operators labels May 17, 2019

xiaomengy force-pushed the export-D15400974 branch from a2fabf9 to 65d086b Compare May 17, 2019 23:04

xiaomengy force-pushed the export-D15400974 branch from 65d086b to 6b8c1ea Compare May 17, 2019 23:10

soumith mentioned this pull request May 18, 2019

Implementing GELU activation #20464

Closed

soumith requested a review from gchanan May 18, 2019 22:31

soumith suggested changes May 18, 2019

View reviewed changes

soumith reviewed May 18, 2019

View reviewed changes

fmassa reviewed May 19, 2019

View reviewed changes

gchanan reviewed May 20, 2019

View reviewed changes

xiaomengy force-pushed the export-D15400974 branch from 6b8c1ea to 719df2c Compare May 27, 2019 02:03

xiaomengy force-pushed the export-D15400974 branch from 719df2c to c6a4f49 Compare May 27, 2019 02:12

xiaomengy force-pushed the export-D15400974 branch from c6a4f49 to ced4ffc Compare May 27, 2019 04:01

pytorchbot added the module: nn Related to torch.nn label May 27, 2019

xiaomengy force-pushed the export-D15400974 branch from ced4ffc to 243e727 Compare May 27, 2019 04:09

xiaomengy force-pushed the export-D15400974 branch from 243e727 to 531498a Compare May 27, 2019 18:25

xiaomengy force-pushed the export-D15400974 branch from 531498a to 36b190f Compare May 27, 2019 18:29

xiaomengy force-pushed the export-D15400974 branch from 36b190f to aa0c817 Compare May 28, 2019 19:33

soumith approved these changes Jun 1, 2019

View reviewed changes

xiaomengy force-pushed the export-D15400974 branch from 05c9e07 to dda06a3 Compare June 2, 2019 07:05

pytorchbot added the module: docs Related to our documentation, both in docs/ and docblocks label Jun 2, 2019

xiaomengy force-pushed the export-D15400974 branch from dda06a3 to baeb0b8 Compare June 2, 2019 07:07

facebook-github-bot closed this in 93ae040 Jun 2, 2019

xiaomengy deleted the export-D15400974 branch June 2, 2019 16:23

facebook-github-bot added the merged label Jun 2, 2019

gchanan reviewed Jun 6, 2019

View reviewed changes

BramVanroy mentioned this pull request Sep 27, 2019

Use PyTorch's GELU activation huggingface/transformers#1347

Closed

BramVanroy mentioned this pull request Sep 30, 2019

Make activation functions available from modeling_utils (PyTorch) huggingface/transformers#1371

Closed

stephenroller mentioned this pull request Oct 26, 2019

[transformer] Switch to torch.gelu facebookresearch/ParlAI#2103

Closed

mruberry added the Merged label Oct 28, 2020

Add gelu activation in pytorch #20665

Add gelu activation in pytorch #20665

Uh oh!

Conversation

xiaomengy commented May 17, 2019

Uh oh!

soumith left a comment

Choose a reason for hiding this comment

Uh oh!

soumith May 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fmassa commented May 19, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fmassa May 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaomengy commented May 27, 2019

Uh oh!

soumith left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 2, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BramVanroy commented Sep 22, 2019

Uh oh!

cpuhrsch commented Sep 23, 2019

Uh oh!

BramVanroy commented Sep 23, 2019

Uh oh!

BramVanroy commented Sep 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaomengy commented Sep 28, 2019

Uh oh!

BramVanroy commented Sep 29, 2019

Uh oh!

xiaomengy commented Sep 30, 2019

Uh oh!

BramVanroy commented Sep 30, 2019

soumith May 18, 2019 •

edited

Loading

fmassa May 19, 2019 •

edited

Loading

BramVanroy commented Sep 28, 2019 •

edited

Loading