Skip to content

Conversation

@t-vi
Copy link
Collaborator

@t-vi t-vi commented Jan 20, 2019

Here is a stab at implementing an option to zero out infinite losses (and NaN gradients).
It might be nicer to move the zeroing to the respective kernels.
The default is currently False to mimic the old behaviour, but I'd be half inclined to set the default to True, because the behaviour wasn't consistent between CuDNN and Native anyways and the NaN gradients aren't terribly useful.

This topic seems to come up regularly, e.g. in #14335

@t-vi
Copy link
Collaborator Author

t-vi commented Jan 20, 2019

@jinserk , @SeanNaren in case it's of interest to you.

@jinserk
Copy link

jinserk commented Jan 20, 2019 via email

@t-vi
Copy link
Collaborator Author

t-vi commented Jan 22, 2019

I managed to reproduce the CI failures, so I'm hopeful to find the root cause soonish.

@ryanleary
Copy link

@soumith can this make 1.0.1?

@t-vi
Copy link
Collaborator Author

t-vi commented Jan 23, 2019 via email

@ryanleary
Copy link

Ah I misread the order of comment and your last commit -- assumed the remaining failures were spurious.

@t-vi
Copy link
Collaborator Author

t-vi commented Feb 5, 2019

So the errors don't speak to me as something related to the patch and I would think it is good to go.

@soumith
Copy link
Contributor

soumith commented Feb 7, 2019

I'd love to merge it, but it says there's merge conflicts.

@t-vi
Copy link
Collaborator Author

t-vi commented Feb 7, 2019

The big IntList rename hit it. I'll rebase.

@t-vi
Copy link
Collaborator Author

t-vi commented Feb 7, 2019

I think it should be good now.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Feb 11, 2019
Summary:
Here is a stab at implementing an option to zero out infinite losses (and NaN gradients).
It might be nicer to move the zeroing to the respective kernels.
The default is currently `False` to mimic the old behaviour, but I'd be half inclined to set the default to `True`, because the behaviour wasn't consistent between CuDNN and Native anyways and the NaN gradients aren't terribly useful.

This topic seems to come up regularly, e.g. in  #14335
Pull Request resolved: pytorch/pytorch#16199

Differential Revision: D14020462

Pulled By: ezyang

fbshipit-source-id: 5ba8936c66ec6e61530aaf01175dc49f389ae428
pearu pushed a commit to Quansight/pytorch that referenced this pull request Feb 12, 2019
Summary:
Here is a stab at implementing an option to zero out infinite losses (and NaN gradients).
It might be nicer to move the zeroing to the respective kernels.
The default is currently `False` to mimic the old behaviour, but I'd be half inclined to set the default to `True`, because the behaviour wasn't consistent between CuDNN and Native anyways and the NaN gradients aren't terribly useful.

This topic seems to come up regularly, e.g. in  pytorch#14335
Pull Request resolved: pytorch#16199

Differential Revision: D14020462

Pulled By: ezyang

fbshipit-source-id: 5ba8936c66ec6e61530aaf01175dc49f389ae428
@chrisemezue
Copy link

chrisemezue commented Dec 22, 2021

@t-vi a naive question please:
If I am using CPU (mistakenly forgot to change Colab runtime to GPU) and I installed torch for cuda (Version: 1.10.0+cu111) and I am using the zero_infinity = True on CTCLoss, will the CPU-GPU difference make it not work?

I was having issues with the NaN loss even with zero_infinity = True. However, when I changed runtime to GPU, there were no more NaN losses ( I am guessing the zero_infinity = True actually kicked in).

I just want to be sure that the GPU-CPU difference (installing torch+cu and using CPU runtime) was definitely the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants