Skip to content

Conversation

@avmgithub
Copy link
Contributor

As reported in Issue #10726, the jit compiler, when running on ppc64le, may produce an isomorphic output but fail a diff test against the expected output file. The expected output file is created from a test that was ran on x86_64. This ensures that if ppc64le test output is different, the output is instead compared to an expected output file created when the test is run on a ppc64le system.

@avmgithub
Copy link
Contributor Author

@pytorchbot retest this please

ezyang
ezyang previously requested changes Sep 14, 2018
Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really sustainable; most devs will not be able to keep the ppc files up to date. It would be better to figure out why results are different on ppc.

@avmgithub
Copy link
Contributor Author

@ezyang no problem, I understand. Is there anyway you can give us some hints on where this problem maybe in the source. We had a similar problem before reported in issue #5055. And it was fixed in #5124. That was all in c++ code. This one is in Python code.

@ezyang
Copy link
Contributor

ezyang commented Sep 14, 2018

So, the first thing I would check is that the canonicalizer is working on subgraphs. There don't appear to be structural differences; and if that's the case, the graphs SHOULD number identically. It really looks like we're not canonicalizing subgraphs.

@avmgithub
Copy link
Contributor Author

@ezyang are you referring to the Canonicalize function here

std::shared_ptr<Graph> Canonicalize(const std::shared_ptr<Graph>& graph) {
? And when you refer to "not canonicalizing subgraphs" are you referring to :
r_node->g_(attr::Subgraph, Canonicalize(node->g(attr::Subgraph)));
?

It looks to me they are fine. If they are not would I not see the problem also happen in the forward graph?

Just to add to the problem description: The problem only happens for the backward graph . The forward graph is fine.

So for example part of the output of the backward graph of the test_milstm_fusion_cuda is below:

Output of ppc64le
%22 : Float(*, ) = aten::mul(%8, %3)
%23 : Float(
, ) = aten::neg(%3)
%24 : int = prim::Constantvalue=1
%25 : Float(
, ) = aten::add(%23, %24, %24)
%26 : Float(
, *) = aten::mul(%22, %25)

Output of x86
%22 : Float(*, ) = aten::neg(%3)
%23 : int = prim::Constantvalue=1
%24 : Float(
, ) = aten::add(%22, %23, %23)
%25 : Float(
, ) = aten::mul(%8, %3)
%26 : Float(
, *) = aten::mul(%25, %24)

So it seems the instructions are kind of re-arranged. Do you think this is due to canonicalization ?
I'm thinking something to do with the backward functionality. But I may be totally wrong. Need your expert opinion and help as to which source file I need to be looking at.

@zou3519
Copy link
Contributor

zou3519 commented Sep 18, 2018

@avmgithub from not knowing very much about what is wrong, I would start from

ExecutionPlan compileSpec(const ArgumentSpec & spec) {
and print out the graph (std::cout << *graph << std::endl;) for a ppc64le arch and x86 arch and see where it differs

@avmgithub
Copy link
Contributor Author

@zou3519 , I've attached 2 files , one for x86 and one ppc64le . From a test run like : python -m unittest -q test_jit.TestScript.test_milstm_fusion_cuda . There's differences in the sigmoid function. I'm assuming the graph is for the backward operation. Here is an example:

x86 :
%56 : Dynamic = aten::neg(%outgate)
%57 : Dynamic = aten::add(%56, %28, %28)
%58 : Dynamic = aten::mul(%34, %outgate)
%59 : Dynamic = aten::mul(%58, %57)

ppc :
%56 : Dynamic = aten::mul(%34, %outgate)
%57 : Dynamic = aten::neg(%outgate)
%58 : Dynamic = aten::add(%57, %28, %28)
%59 : Dynamic = aten::mul(%56, %58)

There are 3 such instances since there are three sigmoid functions in the MiLSTMCell function in test_jit.py

@avmgithub
Copy link
Contributor Author

avmgithub commented Sep 20, 2018

x86.txt
ppc.txt

Let me know which source I can start looking for differences. I appreciate the help

@avmgithub
Copy link
Contributor Author

avmgithub commented Sep 24, 2018

@zou3519 When you get the chance, do know where (which source file) the sigmoid backward graph is formed ? This seems to be where the discrepancy is coming from.

example:

%70 : Dynamic = prim::GradOfname="aten::sigmoid"
block0() {
%71 : Dynamic = aten::mul(%49, %ingate)
%72 : Dynamic = aten::neg(%ingate)
%73 : Dynamic = aten::add(%72, %28, %28)
%74 : Dynamic = aten::mul(%71, %73)
-> (%74)
}

@zou3519
Copy link
Contributor

zou3519 commented Sep 24, 2018

it depends, but it is most likely coming from autodiff here:

} else if (node->matches("aten::sigmoid(Tensor self) -> Tensor")) {

@avmgithub
Copy link
Contributor Author

avmgithub commented Sep 24, 2018

@zou3519 Thanks for the tip. It looks like if I re-arrange the line :
return {grads.at(0) * outputs.at(0) * (1 - outputs.at(0))};
to
return {((1 - outputs.at(0)) * outputs.at(0) * grads.at(0))};

And re-run the : python test/test_jit.py TestScript.test_milstm_fusion_cuda --accept
both expect files (forward and backward / ppc64 and x86_64) are now identical.

Can you please let me know if this is OK to do.

I have no idea why I have to re-arrange it to give me the same backward expect output.

@avmgithub avmgithub changed the title Add ppc expect file comparison logic, fixes #10726 Fix for ppc64le jit graph difference in sigmoid backward, see #10726 Sep 25, 2018
@avmgithub
Copy link
Contributor Author

@ezyang @zou3519 When you get the chance, please review suggested changes.

@sdmonov
Copy link
Contributor

sdmonov commented Sep 26, 2018

I tested the proposed solution on a ppc64le and it works for me.

Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a complete fix, but it works for now and is simple enough. Please add a TODO/comment into the code to prevent regressions and so that someone in the future will take a deeper look at it

@avmgithub
Copy link
Contributor Author

@zou3519 Thanks for approving, was there something else that needs to be done before this can be merged. It still says merging is blocked

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants