Training with fp16 precision gives nan in Longt5

### System Info

- `transformers` version: 4.10.0.dev0
- Platform: Linux-3.10.0-1160.62.1.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.13
- PyTorch version (GPU?): 1.9.0+cu111 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes

### Who can help?

@patrickvonplaten

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

I'm currently running the [scrolls_benchmark](https://github.com/tau-nlp/scrolls). I'm interested to see the performance of longt5 model on scrolls, so I changed the model name to [google/long-t5-tglobal-base](https://huggingface.co/google/long-t5-tglobal-base) and run training with fp16 enabled (If I run with fp32, I get CUDA OOM errors). However, the output loss is always nan. I googled for fixes and found this post: [t5-fp16-fixed](https://discuss.huggingface.co/t/t5-fp16-issue-is-fixed/3139). I searched in the transformers repo and found that the [modelling_longt5](https://github.com/huggingface/transformers/blob/main/src/transformers/models/longt5/modeling_longt5.py) file doesn't seem to incorporate the `clamp_value` change. I wonder if this is the problem that fp16 is not working in longt5? And if so, is there a way to fix it by a similar approach like what you guys have done for t5? Thank you very much!  

fyi: You probably noticed that the transformers version is 4.10.0 which does not have longt5. I manually added the longt5 files in a forked scrolls repo here [longt5_folder](https://github.com/Leonard907/scrolls_ilcc/tree/test_add_longt5_folder). It indeed works properly under a small parameter setting.

### Expected behavior

longt5 model not producing nan loss on fp16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training with fp16 precision gives nan in Longt5 #17978

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training with fp16 precision gives nan in Longt5 #17978

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions