optionally zero infinite losses in CTCLoss #16199

t-vi · 2019-01-20T20:39:10Z

Here is a stab at implementing an option to zero out infinite losses (and NaN gradients).
It might be nicer to move the zeroing to the respective kernels.
The default is currently False to mimic the old behaviour, but I'd be half inclined to set the default to True, because the behaviour wasn't consistent between CuDNN and Native anyways and the NaN gradients aren't terribly useful.

This topic seems to come up regularly, e.g. in #14335

t-vi · 2019-01-20T20:44:13Z

@jinserk , @SeanNaren in case it's of interest to you.

jinserk · 2019-01-20T20:47:05Z

Thank you for letting me know!! :D

…

On Sun, Jan 20, 2019, 3:44 PM Thomas Viehmann ***@***.*** wrote: @jinserk <https://github.com/jinserk> , @SeanNaren <https://github.com/SeanNaren> in case it's of interest to you. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16199 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAyPtjo3C5Omve879odHhWXNHNh-1421ks5vFNUzgaJpZM4aJ1lq> .

t-vi · 2019-01-22T05:40:02Z

I managed to reproduce the CI failures, so I'm hopeful to find the root cause soonish.

ryanleary · 2019-01-23T04:23:26Z

@soumith can this make 1.0.1?

t-vi · 2019-01-23T05:24:17Z

It's broken with CuDNN right now in a way that is hard to understand and, while I'm not Soumith, I'd see it more in 1.1 after it is fixed.

ryanleary · 2019-01-23T06:27:48Z

Ah I misread the order of comment and your last commit -- assumed the remaining failures were spurious.

and bring back changes lost in merge

t-vi · 2019-02-05T19:59:18Z

So the errors don't speak to me as something related to the patch and I would think it is good to go.

soumith · 2019-02-07T07:42:43Z

I'd love to merge it, but it says there's merge conflicts.

t-vi · 2019-02-07T07:56:05Z

The big IntList rename hit it. I'll rebase.

t-vi · 2019-02-07T19:23:44Z

I think it should be good now.

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Here is a stab at implementing an option to zero out infinite losses (and NaN gradients). It might be nicer to move the zeroing to the respective kernels. The default is currently `False` to mimic the old behaviour, but I'd be half inclined to set the default to `True`, because the behaviour wasn't consistent between CuDNN and Native anyways and the NaN gradients aren't terribly useful. This topic seems to come up regularly, e.g. in #14335 Pull Request resolved: pytorch/pytorch#16199 Differential Revision: D14020462 Pulled By: ezyang fbshipit-source-id: 5ba8936c66ec6e61530aaf01175dc49f389ae428

Summary: Here is a stab at implementing an option to zero out infinite losses (and NaN gradients). It might be nicer to move the zeroing to the respective kernels. The default is currently `False` to mimic the old behaviour, but I'd be half inclined to set the default to `True`, because the behaviour wasn't consistent between CuDNN and Native anyways and the NaN gradients aren't terribly useful. This topic seems to come up regularly, e.g. in pytorch#14335 Pull Request resolved: pytorch#16199 Differential Revision: D14020462 Pulled By: ezyang fbshipit-source-id: 5ba8936c66ec6e61530aaf01175dc49f389ae428

chrisemezue · 2021-12-22T18:42:51Z

@t-vi a naive question please:
If I am using CPU (mistakenly forgot to change Colab runtime to GPU) and I installed torch for cuda (Version: 1.10.0+cu111) and I am using the zero_infinity = True on CTCLoss, will the CPU-GPU difference make it not work?

I was having issues with the NaN loss even with zero_infinity = True. However, when I changed runtime to GPU, there were no more NaN losses ( I am guessing the zero_infinity = True actually kicked in).

I just want to be sure that the GPU-CPU difference (installing torch+cu and using CPU runtime) was definitely the issue.

optionally zero infinite losses in CTCLoss

ac51324

fix missing self

4bd05a2

t-vi added 2 commits February 5, 2019 15:08

Merge branch 'master' into ctc_zero_infinity

974ac8e

cudnn doesn't like input length < target length

b34b514

and bring back changes lost in merge

t-vi added 2 commits February 7, 2019 09:58

Merge branch 'master' into ctc_zero_infinity

6ae89c0

fix merge

8bad8bd

ezyang approved these changes Feb 10, 2019

View reviewed changes

facebook-github-bot reviewed Feb 10, 2019

View reviewed changes

facebook-github-bot closed this in 29f096c Feb 11, 2019

yqwangustc mentioned this pull request Jun 1, 2019

zero out possible NaN in pytorch.ctc_loss #21244

Closed

ezyang added open source merged labels Jun 24, 2019

dmitrytyrin mentioned this pull request Feb 12, 2020

NaN and inf checks in PyTorch's actions' implementation NVIDIA-NeMo/NeMo#260

Closed

optionally zero infinite losses in CTCLoss #16199

optionally zero infinite losses in CTCLoss #16199

Uh oh!

Conversation

t-vi commented Jan 20, 2019

Uh oh!

t-vi commented Jan 20, 2019

Uh oh!

jinserk commented Jan 20, 2019 via email

Uh oh!

t-vi commented Jan 22, 2019

Uh oh!

ryanleary commented Jan 23, 2019

Uh oh!

t-vi commented Jan 23, 2019 via email

Uh oh!

ryanleary commented Jan 23, 2019

Uh oh!

t-vi commented Feb 5, 2019

Uh oh!

soumith commented Feb 7, 2019

Uh oh!

t-vi commented Feb 7, 2019

Uh oh!

t-vi commented Feb 7, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

chrisemezue commented Dec 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

chrisemezue commented Dec 22, 2021 •

edited

Loading