-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix for ppc64le jit graph difference in sigmoid backward, see #10726 #11579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@pytorchbot retest this please |
ezyang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really sustainable; most devs will not be able to keep the ppc files up to date. It would be better to figure out why results are different on ppc.
|
So, the first thing I would check is that the canonicalizer is working on subgraphs. There don't appear to be structural differences; and if that's the case, the graphs SHOULD number identically. It really looks like we're not canonicalizing subgraphs. |
|
@ezyang are you referring to the Canonicalize function here
It looks to me they are fine. If they are not would I not see the problem also happen in the forward graph? Just to add to the problem description: The problem only happens for the backward graph . The forward graph is fine. So for example part of the output of the backward graph of the test_milstm_fusion_cuda is below: Output of ppc64le Output of x86 So it seems the instructions are kind of re-arranged. Do you think this is due to canonicalization ? |
|
@avmgithub from not knowing very much about what is wrong, I would start from pytorch/torch/csrc/jit/graph_executor.cpp Line 372 in e00fb69
std::cout << *graph << std::endl;) for a ppc64le arch and x86 arch and see where it differs
|
|
@zou3519 , I've attached 2 files , one for x86 and one ppc64le . From a test run like : python -m unittest -q test_jit.TestScript.test_milstm_fusion_cuda . There's differences in the sigmoid function. I'm assuming the graph is for the backward operation. Here is an example: x86 : ppc : There are 3 such instances since there are three sigmoid functions in the MiLSTMCell function in test_jit.py |
|
@zou3519 When you get the chance, do know where (which source file) the sigmoid backward graph is formed ? This seems to be where the discrepancy is coming from. example: %70 : Dynamic = prim::GradOfname="aten::sigmoid" |
|
it depends, but it is most likely coming from autodiff here: pytorch/torch/csrc/jit/autodiff.cpp Line 101 in 76ab26c
|
|
@zou3519 Thanks for the tip. It looks like if I re-arrange the line : And re-run the : python test/test_jit.py TestScript.test_milstm_fusion_cuda --accept Can you please let me know if this is OK to do. I have no idea why I have to re-arrange it to give me the same backward expect output. |
|
I tested the proposed solution on a ppc64le and it works for me. |
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a complete fix, but it works for now and is simple enough. Please add a TODO/comment into the code to prevent regressions and so that someone in the future will take a deeper look at it
|
@zou3519 Thanks for approving, was there something else that needs to be done before this can be merged. It still says merging is blocked |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
As reported in Issue #10726, the jit compiler, when running on ppc64le, may produce an isomorphic output but fail a diff test against the expected output file. The expected output file is created from a test that was ran on x86_64. This ensures that if ppc64le test output is different, the output is instead compared to an expected output file created when the test is run on a ppc64le system.