-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix nll_backward for negative weights #64572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It also fixes an inconsistent treatment of the case `reduction = "mean"` when the whole target is equal to `ignore_index`. It now returns `NaN` in this case, consistently with what it returns when computing the mean over an empty tensor. We add tests for all these cases. [ghstack-poisoned]
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 3594ae9 (more details on the Dr. CI page):
🕵️ 3 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
It also fixes an inconsistent treatment of the case `reduction = "mean"` when the whole target is equal to `ignore_index`. It now returns `NaN` in this case, consistently with what it returns when computing the mean over an empty tensor. We add tests for all these cases. ghstack-source-id: a88c848 Pull Request resolved: #64572
|
This has an UBsan problem that I'll fix an explicit instantiation of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! :)
Since the behavior here wrt reduction='mean' is different now, it's possible that some edge cases in the wild might break.
Example: a batch full of targets that have weight=0. Before, this would return 0, but now it will return NaN. According to the math in the docs, NaN is correct, but that math is contested in #61309, and it's possible returning NaN instead of 0 would be surprising for similar reasons that the way we calculate the mean is surprising to some.
Further, if we change the behavior, we'd need to make corresponding changes in XLA and deal with the UBSAN failure, as evidenced by the failing checks. These aren't hard to do; just requires some coordination with the XLA team and creating a NaN explicitly instead of doing 0/0.
However, rather than go through with all of this, I think I'd prefer to keep this PR to the minimal fix necessary to correctly handle negative weights for now. We can possibly address the BC-breaking parts in a future PR to keep things contained. Wdyt about this?
|
About the BC-breaking. I don't think this is BC-breaking as much as a bug fix. As you mentioned, this is the behaviour in the docs. Even more, at the moment the behaviour is inconsistent across sizes / devices, so that is not something desirable. I say we fix this. |
|
Hi @jbschlosser , sounds good. @lezcano Feel free to make the change on the pytroch side, xla CI in this pr should fail and youcan open an issue on the pt/xla side. We will follow up. |
|
Otherwise, just need the UBSAN fix and we're good to go :) |
Fixes #64256 It also fixes an inconsistent treatment of the case `reduction = "mean"` when the whole target is equal to `ignore_index`. It now returns `NaN` in this case, consistently with what it returns when computing the mean over an empty tensor. We add tests for all these cases. [ghstack-poisoned]
It also fixes an inconsistent treatment of the case `reduction = "mean"` when the whole target is equal to `ignore_index`. It now returns `NaN` in this case, consistently with what it returns when computing the mean over an empty tensor. We add tests for all these cases. ghstack-source-id: 6b859cf Pull Request resolved: #64572
|
@jbschlosser I fixed the UBSAN failures. Now the only CI failures remaining are those from XLA (expected) and a couple that complain about the base of this branch. This should be ready. |
|
@lezcano pt/xla CI failure seems to be related to merge conflict. any chance you can rebase and surface the true error? If you could let me know which test you expect to fail I might be able to repo the failure myself too. |
Fixes #64256 It also fixes an inconsistent treatment of the case `reduction = "mean"` when the whole target is equal to `ignore_index`. It now returns `NaN` in this case, consistently with what it returns when computing the mean over an empty tensor. We add tests for all these cases. [ghstack-poisoned]
It also fixes an inconsistent treatment of the case `reduction = "mean"` when the whole target is equal to `ignore_index`. It now returns `NaN` in this case, consistently with what it returns when computing the mean over an empty tensor. We add tests for all these cases. ghstack-source-id: e20651a Pull Request resolved: #64572
|
@JackCaoG rebased. |
|
@lezcano pt/xla test passed, is the new behavior not being tested by the existing test? |
|
@JackCaoG This PR adds new tests, and note that they are failing in an XLA run now. |
jbschlosser
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks :)
|
@lezcano Thanks, I will take care of it. |
|
@jbschlosser has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
@jbschlosser I will update here when the xla change is ready |
Hm, I'm now running into internal tests that explicitly depend on the old behavior of returning 0.0 for the "all ignored" case, which makes me hesitate again at changing this behavior. This is for an internal loss function based on NLLLoss that explicitly claims to support missing labels - returning NaNs here doesn't seem nice for this case. If PyTorch had a nice way to indicate "no gradients" when computing the mean of an empty set, we wouldn't have to choose between NaN (breaks training) and 0.0 (allows training to continue, but not mathematically correct and may have undesirable side effects). cc @cpuhrsch: FYI I do agree it's undesirable to have inconsistency across sizes / devices - do you mind pointing out the current discrepancies and maybe we could at least fix those? |
|
Note that it supports missing labels, but if you give it something with no labels at all, then the reduction should return Also, having "no gradients" would not solve this, as the NaNs appear in the forward. The backward may return whatever, as the gradient at |
It also fixes an inconsistent treatment of the case `reduction = "mean"` when the whole target is equal to `ignore_index`. It now returns `NaN` in this case, consistently with what it returns when computing the mean over an empty tensor. We add tests for all these cases. ghstack-source-id: f154576 Pull Request resolved: #64572
|
@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
The xla test failure seems real. Do we need a sister PR on the xla repo? |
|
@albanD I will work on a fix. Already have a prototype, just need to fix some additional failure on the xla end. |
|
I'm not very familiar with the joint landing process. |
|
@albanD Please give me 1~2 days, I will update here when xla pr is ready. After that I will merge the xla pr when pytorch pr is merged. |
|
@albanD @lezcano For nll_loss, should grad remain 0 if the forwarding pass return after my change, CPU returns and xla returns I can update the xla behavior but want to make sure this is intended. |
|
I would say that a grad of zero is fine, as it is the fastest one to return. The function is not differentiable at these points (it's not even well-defined) so we are free to return anything we want. As such, we can choose to return whatever's fastest for the computations. In this case, we can return zeros if all the elements in target were ignored, and inf or nan if the sum of the weights was zero (inf if the |
Fixes #64256 It also fixes an inconsistent treatment of the case `reduction = "mean"` when the whole target is equal to `ignore_index`. It now returns `NaN` in this case, consistently with what it returns when computing the mean over an empty tensor. We add tests for all these cases. Differential Revision: [D31116297](https://our.internmc.facebook.com/intern/diff/D31116297) cc ezyang gchanan [ghstack-poisoned]
It also fixes an inconsistent treatment of the case `reduction = "mean"` when the whole target is equal to `ignore_index`. It now returns `NaN` in this case, consistently with what it returns when computing the mean over an empty tensor. We add tests for all these cases. ghstack-source-id: dcb4907 Pull Request resolved: #64572
|
For what is worth, I have updated the code for the derivatives to make it more homogeneous. It should now be obvious that all the paths in both CPU and CUDA return what was described in #64572 (comment) |
|
Thanks! |
|
@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
@JackCaoG ok all the internal tests are good. So ready on my end. Let me know when it's good for you! |
|
@albanD xla pr is ready at pytorch/xla#3144 (review). I will merge it after this pr being merged. |
|
Thanks! |
Pytorch nll behavior has changed - now, where the sum of the weight values is zero, the output will be NaN. Our LTC tests must be updated accordingly. - pytorch/pytorch changed NLL behavior (#64572) - pytorch/xla introduced a corresponding change to test utils (pytorch/xla#3144)
This PR allows results from `nn_loss` to be `nan`, which is the same behavior as with CUDA and CPU #64572 (comment). Fixes #134431 Ref #64572 #119108 Pull Request resolved: #135434 Approved by: https://github.com/malfet
This PR allows results from `nn_loss` to be `nan`, which is the same behavior as with CUDA and CPU #64572 (comment). Fixes #134431 Ref #64572 #119108 Pull Request resolved: #135434 Approved by: https://github.com/malfet
This PR allows results from `nn_loss` to be `nan`, which is the same behavior as with CUDA and CPU pytorch#64572 (comment). Fixes pytorch#134431 Ref pytorch#64572 pytorch#119108 Pull Request resolved: pytorch#135434 Approved by: https://github.com/malfet
This PR allows results from `nn_loss` to be `nan`, which is the same behavior as with CUDA and CPU pytorch#64572 (comment). Fixes pytorch#134431 Ref pytorch#64572 pytorch#119108 Pull Request resolved: pytorch#135434 Approved by: https://github.com/malfet
Stack from ghstack:
Fixes #64256
It also fixes an inconsistent treatment of the case
reduction = "mean"when the whole target is equal to
ignore_index. It now returnsNaNin this case, consistently with what it returns when computing the mean
over an empty tensor.
We add tests for all these cases.
Differential Revision: D31116297
cc @ezyang @gchanan