Add the pytorch implementation of the OpenAI GeLU approximation

### Feature request

Add support for the pytorch implementation of OpenAI's approximation of the GeLU function, added in pytorch 1.12. This implementation is equivalent to `gelu_new` or `gelu_fast` but much faster. It can come as a separate activation function, for example `gelu_new_python`, to avoid distrupting existing models.


### Motivation

Many transformer models use OpenAI's approximation (tanh) for the GeLU, through the activation function `gelu_new` or `gelu_fast`. These implementations are extremely slow (despite their name) because they consist of multiple operations/kernels (8 and 9 respectively). 

Since version 1.12, pytorch supports a single-kernel, C/cuda implementation through the argument `approximate='tanh'` ( https://pytorch.org/docs/stable/generated/torch.nn.GELU.html). This implementation is 6-10x faster than what currently exists in transformers, and is numerically equal up to rounding errors.

When benchmarking the inference speed of the [SantaCoder models](https://huggingface.co/bigcode/santacoder), I found that using the pytorch implementation allowed for an end-to-end speedup of ~15-20%.

I also benchmarked the speed and accuracy using the following code (on a A100-80GB):
```

import time
import torch
from transformers.activations import NewGELUActivation, FastGELUActivation

dtype=torch.float32
eps=torch.finfo(dtype).eps

x=torch.empty([2**30], device="cuda", dtype=dtype).normal_()
torch.cuda.synchronize()
t0=time.perf_counter()

y0=torch.nn.functional.gelu(x, approximate="tanh")
torch.cuda.synchronize()
t1=time.perf_counter()

y1=NewGELUActivation()(x)
torch.cuda.synchronize()
t2=time.perf_counter()

y2=FastGELUActivation()(x)
torch.cuda.synchronize()
t3=time.perf_counter()

y3=torch.nn.functional.gelu(x)
torch.cuda.synchronize()
t4=time.perf_counter()

print(f"Torch tanh: {1000*(t1-t0):.3f} ms")
print(f"New: {1000*(t2-t1):.3f} ms")
print(f"Fast: {1000*(t3-t2):.3f} ms")
print(f"Torch orig: {1000*(t4-t3):.3f} ms")

print(f"Torch tanh vs new: {(y1-y0).float().std().cpu().item()/eps:.3f}")
print(f"Torch tanh vs fast: {(y2-y0).float().std().cpu().item()/eps:.3f}")
print(f"New vs fast: {(y2-y1).float().std().cpu().item()/eps:.3f}")
print(f"Torch tanh vs torch orig: {(y3-y0).float().std().cpu().item()/eps:.3f}")

```

With output
```
Torch tanh: 4.921 ms
New: 43.253 ms
Fast: 50.269 ms
Torch orig: 4.989 ms
Torch tanh vs new: 0.042
Torch tanh vs fast: 0.147
New vs fast: 0.147
Torch tanh vs torch orig: 971.960
```
I.e., the tanh version of torch matches the fast and new gelu within epsilon while being 8.8x/10.2x faster, but is different from the original version


With dtype=torch.float16:
```
Torch tanh: 3.342 ms
New: 22.667 ms
Fast: 26.104 ms
Torch orig: 3.395 ms
Torch tanh vs new: 0.244
Torch tanh vs fast: 0.243
New vs fast: 0.143
Torch tanh vs torch orig: 0.216
```
I.e., it's 6.8x/7.8x faster, and the implementation doesn't matters because rounding errors dominate.


On cpu (float32), size 2**28 (268M):
```
Torch tanh: 182.575 ms
New: 1683.934 ms
Fast: 1925.547 ms
Torch orig: 141.410 ms
Torch tanh vs new: 0.043
Torch tanh vs fast: 0.144
New vs fast: 0.144
Torch tanh vs torch orig: 971.852
```
I.e., same accuracy and speedup (9.2x/10.5x faster)

### Your contribution

Opened a draft PR (#21345)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add the pytorch implementation of the OpenAI GeLU approximation #21344

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add the pytorch implementation of the OpenAI GeLU approximation #21344

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions