Fix for ppc64le jit graph difference in sigmoid backward, see #10726 #11579

avmgithub · 2018-09-12T16:51:05Z

As reported in Issue #10726, the jit compiler, when running on ppc64le, may produce an isomorphic output but fail a diff test against the expected output file. The expected output file is created from a test that was ran on x86_64. This ensures that if ppc64le test output is different, the output is instead compared to an expected output file created when the test is run on a ppc64le system.

avmgithub · 2018-09-13T12:16:04Z

@pytorchbot retest this please

ezyang

This is not really sustainable; most devs will not be able to keep the ppc files up to date. It would be better to figure out why results are different on ppc.

avmgithub · 2018-09-14T11:30:02Z

@ezyang no problem, I understand. Is there anyway you can give us some hints on where this problem maybe in the source. We had a similar problem before reported in issue #5055. And it was fixed in #5124. That was all in c++ code. This one is in Python code.

ezyang · 2018-09-14T15:29:22Z

So, the first thing I would check is that the canonicalizer is working on subgraphs. There don't appear to be structural differences; and if that's the case, the graphs SHOULD number identically. It really looks like we're not canonicalizing subgraphs.

avmgithub · 2018-09-18T13:28:24Z

@ezyang are you referring to the Canonicalize function here

pytorch/torch/csrc/jit/passes/canonicalize.cpp

Line 7 in 00df09b

std::shared_ptr<Graph> Canonicalize(const std::shared_ptr<Graph>& graph) {

? And when you refer to "not canonicalizing subgraphs" are you referring to :

pytorch/torch/csrc/jit/passes/canonicalize.cpp

Line 28 in 00df09b

r_node->g_(attr::Subgraph, Canonicalize(node->g(attr::Subgraph)));

?

It looks to me they are fine. If they are not would I not see the problem also happen in the forward graph?

Just to add to the problem description: The problem only happens for the backward graph . The forward graph is fine.

So for example part of the output of the backward graph of the test_milstm_fusion_cuda is below:

Output of ppc64le
%22 : Float(*, ) = aten::mul(%8, %3)
%23 : Float(, ) = aten::neg(%3)
%24 : int = prim::Constantvalue=1
%25 : Float(, ) = aten::add(%23, %24, %24)
%26 : Float(, *) = aten::mul(%22, %25)

Output of x86
%22 : Float(*, ) = aten::neg(%3)
%23 : int = prim::Constantvalue=1
%24 : Float(, ) = aten::add(%22, %23, %23)
%25 : Float(, ) = aten::mul(%8, %3)
%26 : Float(, *) = aten::mul(%25, %24)

So it seems the instructions are kind of re-arranged. Do you think this is due to canonicalization ?
I'm thinking something to do with the backward functionality. But I may be totally wrong. Need your expert opinion and help as to which source file I need to be looking at.

zou3519 · 2018-09-18T16:49:50Z

@avmgithub from not knowing very much about what is wrong, I would start from

pytorch/torch/csrc/jit/graph_executor.cpp

Line 372 in e00fb69

ExecutionPlan compileSpec(const ArgumentSpec & spec) {

and print out the graph (std::cout << *graph << std::endl;) for a ppc64le arch and x86 arch and see where it differs

avmgithub · 2018-09-20T22:20:44Z

@zou3519 , I've attached 2 files , one for x86 and one ppc64le . From a test run like : python -m unittest -q test_jit.TestScript.test_milstm_fusion_cuda . There's differences in the sigmoid function. I'm assuming the graph is for the backward operation. Here is an example:

x86 :
%56 : Dynamic = aten::neg(%outgate)
%57 : Dynamic = aten::add(%56, %28, %28)
%58 : Dynamic = aten::mul(%34, %outgate)
%59 : Dynamic = aten::mul(%58, %57)

ppc :
%56 : Dynamic = aten::mul(%34, %outgate)
%57 : Dynamic = aten::neg(%outgate)
%58 : Dynamic = aten::add(%57, %28, %28)
%59 : Dynamic = aten::mul(%56, %58)

There are 3 such instances since there are three sigmoid functions in the MiLSTMCell function in test_jit.py

avmgithub · 2018-09-20T22:21:35Z

x86.txt
ppc.txt

Let me know which source I can start looking for differences. I appreciate the help

avmgithub · 2018-09-24T13:04:00Z

@zou3519 When you get the chance, do know where (which source file) the sigmoid backward graph is formed ? This seems to be where the discrepancy is coming from.

example:

%70 : Dynamic = prim::GradOfname="aten::sigmoid"
block0() {
%71 : Dynamic = aten::mul(%49, %ingate)
%72 : Dynamic = aten::neg(%ingate)
%73 : Dynamic = aten::add(%72, %28, %28)
%74 : Dynamic = aten::mul(%71, %73)
-> (%74)
}

zou3519 · 2018-09-24T13:59:17Z

it depends, but it is most likely coming from autodiff here:

pytorch/torch/csrc/jit/autodiff.cpp

Line 101 in 76ab26c

} else if (node->matches("aten::sigmoid(Tensor self) -> Tensor")) {

avmgithub · 2018-09-24T20:05:18Z

@zou3519 Thanks for the tip. It looks like if I re-arrange the line :
return {grads.at(0) * outputs.at(0) * (1 - outputs.at(0))};
to
return {((1 - outputs.at(0)) * outputs.at(0) * grads.at(0))};

And re-run the : python test/test_jit.py TestScript.test_milstm_fusion_cuda --accept
both expect files (forward and backward / ppc64 and x86_64) are now identical.

Can you please let me know if this is OK to do.

I have no idea why I have to re-arrange it to give me the same backward expect output.

avmgithub · 2018-09-25T13:26:58Z

@ezyang @zou3519 When you get the chance, please review suggested changes.

sdmonov · 2018-09-26T00:32:56Z

I tested the proposed solution on a ppc64le and it works for me.

zou3519

It's not a complete fix, but it works for now and is simple enough. Please add a TODO/comment into the code to prevent regressions and so that someone in the future will take a deeper look at it

torch/csrc/jit/autodiff.cpp

avmgithub · 2018-09-26T23:52:23Z

@zou3519 Thanks for approving, was there something else that needs to be done before this can be merged. It still says merging is blocked

resolved

facebook-github-bot

soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Add ppc expect file comparison logic

65d7129

avmgithub requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners September 12, 2018 16:51

Clean up comments to pass lint checks

1500e0f

ezyang previously requested changes Sep 14, 2018

View reviewed changes

avmgithub added 5 commits September 25, 2018 07:57

revert back ppc64le changes

893bb09

Merge remote-tracking branch 'upstream/master'

845789b

Change to fix jit graph problem in ppc64le reported in #10726

532a842

Change to fix jit graph problem in ppc64le reported in #10726

0c9e411

Change to fix jit graph problem in ppc64le reported in #10726

f136e45

avmgithub changed the title ~~Add ppc expect file comparison logic, fixes #10726~~ Fix for ppc64le jit graph difference in sigmoid backward, see #10726 Sep 25, 2018

zou3519 approved these changes Sep 26, 2018

View reviewed changes

torch/csrc/jit/autodiff.cpp Show resolved Hide resolved

Change to fix jit graph problem in ppc64le reported in #10726

e5f9b23

soumith approved these changes Sep 27, 2018

View reviewed changes

facebook-github-bot reviewed Sep 27, 2018

View reviewed changes

facebook-github-bot closed this in a72603f Sep 27, 2018

zou3519 mentioned this pull request Jan 7, 2019

[ppc64le/pytorch] test_milstm_fusion_cuda and test_lstm_fusion_cuda fails #10726

Closed

ezyang added open source merged labels Jun 24, 2019

Fix for ppc64le jit graph difference in sigmoid backward, see #10726 #11579

Fix for ppc64le jit graph difference in sigmoid backward, see #10726 #11579

Uh oh!

Conversation

avmgithub commented Sep 12, 2018

Uh oh!

avmgithub commented Sep 13, 2018

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

avmgithub commented Sep 14, 2018

Uh oh!

ezyang commented Sep 14, 2018

Uh oh!

avmgithub commented Sep 18, 2018

Uh oh!

zou3519 commented Sep 18, 2018

Uh oh!

avmgithub commented Sep 20, 2018

Uh oh!

avmgithub commented Sep 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avmgithub commented Sep 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zou3519 commented Sep 24, 2018

Uh oh!

avmgithub commented Sep 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avmgithub commented Sep 25, 2018

Uh oh!

sdmonov commented Sep 26, 2018

Uh oh!

zou3519 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

avmgithub commented Sep 26, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

avmgithub commented Sep 20, 2018 •

edited

Loading

avmgithub commented Sep 24, 2018 •

edited

Loading

avmgithub commented Sep 24, 2018 •

edited

Loading