Skip to content

Conversation

@pmeier
Copy link
Collaborator

@pmeier pmeier commented Aug 24, 2021

@pmeier pmeier added module: nn Related to torch.nn module: tests Issues related to tests (not the torch.testing module) labels Aug 24, 2021
@pmeier pmeier requested a review from zou3519 August 24, 2021 09:53
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 24, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 93bd050 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@pmeier
Copy link
Collaborator Author

pmeier commented Aug 24, 2021

Failures on CUDA look real:

torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor(-0.4172, device='cuda:0')
analytical:tensor(-0.3553, device='cuda:0')

The above quantities relating the numerical and analytical jacobians are computed 
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background 
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:

Numerical:
 tensor([[ 0.0000],
        [ 0.0000],
        [-0.5000],
        [ 0.0000],
        [ 0.0000],
        [-0.5000]], device='cuda:0')
Analytical:
tensor([[ 0.0000],
        [ 0.0000],
        [-0.5000],
        [ 0.0000],
        [ 0.0000],
        [-0.5000]], device='cuda:0')

The max per-element difference (slow mode) is: 3.62396240234375e-05.
Fast gradcheck failed but element-wise differences are small. This means that the
test might've passed in slow_mode!

Failures also happen in slow mode, i.e. setting gradcheck_fast_mode=False.

@zou3519
Copy link
Contributor

zou3519 commented Aug 24, 2021

@pmeier what does the error look like with gradcheck_fast_mode=False?

@pmeier
Copy link
Collaborator Author

pmeier commented Aug 24, 2021

The same as the middle part of the message above:

torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor([[ 0.0000],
        [ 0.0000],
        [-0.4768],
        [ 0.0000],
        [ 0.0000],
        [-0.5364]], device='cuda:0')
analytical:tensor([[ 0.0000],
        [ 0.0000],
        [-0.5000],
        [ 0.0000],
        [ 0.0000],
        [-0.5000]], device='cuda:0')

Although here the differences are visible before the 5th decimal place.

@heitorschueroff heitorschueroff added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 24, 2021
@pmeier
Copy link
Collaborator Author

pmeier commented Aug 25, 2021

I think the failures are due to non-determinism that stems from reducing the output to a scalar. Setting reduction="none" passes the tests without modification.

Comment on lines 5154 to 5159
# (shape_2d, dict()),
# ((*shape_2d, 3, 3), dict()),
# (shape_2d, dict(weight=True)),
# (shape_2d, dict(ignore_index=1)),
# (shape_2d, dict(reduction="mean")),
# (shape_2d, dict(reduction="sum")),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabling any of these sample inputs leads to gradcheck failures. They all have in common that reduction="mean" is the default value and thus a reduction is performed. reduction="none" uses a different code path and works fine. cc @albanD

@codecov
Copy link

codecov bot commented Aug 26, 2021

Codecov Report

Merging #63854 (93bd050) into master (b1154cc) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #63854      +/-   ##
==========================================
- Coverage   66.85%   66.84%   -0.01%     
==========================================
  Files         695      695              
  Lines       90722    90748      +26     
==========================================
+ Hits        60649    60664      +15     
- Misses      30073    30084      +11     

@pmeier
Copy link
Collaborator Author

pmeier commented Aug 30, 2021

Closed in favor of #64203.

@pmeier pmeier closed this Aug 30, 2021
@pmeier pmeier deleted the opinfo/nll_loss branch August 30, 2021 18:37
facebook-github-bot pushed a commit that referenced this pull request Aug 30, 2021
Summary:
Fixes #64163

This PR includes the fix and the opinfo from #63854 for non-regression testing.

cc albanD mruberry jbschlosser

Pull Request resolved: #64203

Reviewed By: albanD

Differential Revision: D30647522

Pulled By: jbschlosser

fbshipit-source-id: 2974d299763505908fa93532aca2bd5d5b71f2e9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed module: nn Related to torch.nn module: tests Issues related to tests (not the torch.testing module) open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants