-
Notifications
You must be signed in to change notification settings - Fork 26.3k
zero out possible NaN in pytorch.ctc_loss #21244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm skeptical about this without some repro. My intuition is that if it produces invalid NaNs occasonally, it will likely produce wrong non-NaN results. Could you please capture the inputs to CTC loss that produces NaNs (i.e. keep them around and when you get NaN gradients save them)? |
|
Oh, and thanks for investigating this. It's awesome to see that you tracked this down! |
|
@t-vi yes, creating a consistent repo is the most important thing here. I was able to repo NaN occasionally by injecting gradient check code during training and pdb once there is any NaN in gradient . But unfortunately, supplying the exact same input to CTCLoss again did not produce NaN. So I am not able to give you a consistent repo yet. I am thinking other approach to get a repo. Btw, we use 1-d target so #20971 won’t help. Also we notice that once we disable large batch handling, NaN also disappears. Since a lot of people’s work pending a numerical stable CTC loss, may I suggest we can merge this PR as a stopgap solution, similar as the purpose of zero_infinity ? |
Summary: When use ATen version of CTCLoss, we observe indeterminstically having NaN in the output gradient. Initial examination reveal that 1) those NaN happens rarely and only at BLANK position. 2) if we disable large batch handling (https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/LossCTC.cu#L552), NaN disappears. Based on the above observation, I believe this line (https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/LossCTC.cu#L560) might incur NaN indeterminstically. Unfortunately, due to the indeterminstic behavior, I am not able to provide a repo thus pinpoint the exact root cause. Before we find it, we'd like to place a stopgap here to zero out any NaN in gradient. Differential Revision: D15591296 fbshipit-source-id: 6865e8da58df69a740f649635c7cc49253b24621
47868be to
0e85c2a
Compare
|
OK, so it's something different. I would indeed want to get us to the situation of |
|
The other thing I'm wondering about: Do you get the NaN in the forward or the backward? Because you also delete them in the forward, even though you describe observing them in the gradient only. |
t-vi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along the lines of the comments above, I think we need some form of repro / example that causes this, and ideally a test.
I have a strong preference for a more proper - i.e. finding and fixing the apparent algorithmic defect - fix.
| if (zero_infinity) { | ||
| grad = at::where(neg_log_likelihood.view({1, batch_size, 1}) == Scalar(INFINITY), at::zeros({}, grad.options()), grad); | ||
| grad = at::where(grad != grad, at::zeros({}, grad.options()), grad); | ||
| // to filter out NaN value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, I would like to see where this is actually happening. You describe (and I can believe that) that it's not necessarily reproducible, but as mentioned above, I think we need to understand what are the causes and I would half expect that the result will be wrong in the cases when otherwise NaN-producing inputs produce non-NaN results.
Can you capture a sample of inputs for which the gradient turns NaN, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@t-vi , yes this is the main fix I want to perform -- GPU kernels on backward, all those CPU code change is to make the zero_infinity option behaves consistently. As I said I can occasionally capture NaN, however, it is not reproducible consistently (the input I captured results in NaN, but next time I supply them to the CTCLoss, NaN is gone) and I am working on it to produce a true example (trying different approaches now ...)
I fully agree with you that we need to find out the root cause to fix it permanently (and I am eager to find it out!). However, since this NaN issue is blocking a lot of other's work (including me, my colleagues and a lot of other people in PyTorch form), may I suggest that we can patch this for the time being ? I will definitely work with you to find out the real solution afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check whether the gradient is correct when it isn't NaN?
Sorry, but without an example, I'm not enthused about zeroing NaNs. Note that most of the people on the forums ended up having errors in their inputs (non-log-probability inputs, invalid targets...). After seeing so many of those I'll have to see an example where the code goes wrong to add bandaids.
The other concern is that if it produces bad NaN gradients, I would not believe the non-NaN gradients either, and these would need a different fix, too.
sorry that I miss this question. I observe NaN appears in gradient first( no NaN in the input) and loss is a normal number in that case. After gradient has NaN, it propagates to model parameters, and then input (to CTCLoss) in the next mini-batch and then everywhere. |
|
@t-vi @yqwangustc and I discussed this offline and decided that we should land this patch to stem the bleeding (but we should continue to investigate what the real problem is.) |
|
I have one more request, which is that |
|
Related: #14335 (no code was provided in the reports but it looks similar) |
|
A bit more information on the input causing NaN.
and the loss is a norm floating point, so forward pass is fine. grad[983, 1:10, 41] has all the NaN value, i.e., all the NaN appears at BLANK posterior but they are just padded position in time sequence. Switching to small batch size condition does not yield any NaN, so I am checking grad before/after this line https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/LossCTC.cu#L560 @t-vi will these information help you to identify something ? update: by inserting the following code, I confirm that https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/LossCTC.cu#L560 is the offending line. @t-vi if I understand it correctly, using update again Additional data examination reveal that |
|
Thank you for the analysis! This is extremely useful. Thanks again for tracking this down! So I'm happy with your fix, maybe we can add a comment that this is in the padding and that when the blank calculation is moved to a kernel, properly ignoring those bits will be better. |
|
Actually, after many iterations, I think I found the real root cause now ! I start to notice after
These are consistent with the pattern of either NaN or some widely large value (e.g., -1.03845089e+34) I saw in So the suggest change:
@t-vi Please kindly correct me if I am wrong. I am running more experiments, but I am quite confident this will permanent fix the problem, :-) If this looks good to you, I can abandon this PR, and create a new one (which only has 1 line change !) |
|
He, shaking that function really has the bugs falling out... Awesome work!
But I think we might still have to zero "out of sequence" gradients due to the use of log_probs. |
|
Do you want help to update the PR? I'd be glad to help code something up, but you did all the hard work already... |
|
@t-vi new PR is on the way and will send it out soon. sorry that just wake up, :-) |
|
Supercool! Thanks! |
Summary: as discussed at pytorch#21244, we found some values in log_beta are not properly initialized. This diff will 1) initialize all log_beta to -inf; 2) fix a tricky compare condition; 3) zero all the gradient elements corresponding to padding to zero. Offline experiments show that this diff can fix previous seen NaN loss. Differential Revision: D15637977 fbshipit-source-id: 8bc35098aade6aa2035f71f499250883f09b0f25
|
Abandon this PR in favor of pytorch/pytorch #21392 |
Summary: Pull Request resolved: #21392 as discussed at #21244, we found some values in log_beta are not properly initialized. This diff will 1) initialize all log_beta to -inf; 2) fix a tricky compare condition; 3) zero all the gradient elements corresponding to padding to zero. Offline experiments show that this diff can fix previous seen NaN loss. Differential Revision: D15637977 fbshipit-source-id: 477008a5e11aae946bd2aa401ab7e0c513421af0
Summary: Pull Request resolved: pytorch/pytorch#21392 as discussed at pytorch/pytorch#21244, we found some values in log_beta are not properly initialized. This diff will 1) initialize all log_beta to -inf; 2) fix a tricky compare condition; 3) zero all the gradient elements corresponding to padding to zero. Offline experiments show that this diff can fix previous seen NaN loss. Differential Revision: D15637977 fbshipit-source-id: 477008a5e11aae946bd2aa401ab7e0c513421af0
When use ATen version of CTCLoss, we observe indeterminstically having NaN in the output gradient. Initial examination reveal that 1) those NaN happens rarely and only at BLANK position. 2) if we disable large batch handling (https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/LossCTC.cu#L552), NaN disappears.
Based on the above observation, I believe this line (https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/LossCTC.cu#L560) might incur NaN indeterminstically.
Unfortunately, due to the indeterminstic behavior, I am not able to provide a repo thus pinpoint the exact root cause. Before we find it, we'd like to place a stopgap here to zero out any NaN in gradient.
Note there is a previous PR which zero out infinity only (#16199), unfortunately, NaN is not infinity. so it won't help if there is a NaN in gradient. This PR overloads the zero_infinity option to zero out both infinity and NaN. Considering they are used for the same purpose, I think it is fine to just use "zero_infinity" option at the moment.
Differential Revision: D15591296