Skip to content

Conversation

@xta0
Copy link
Contributor

@xta0 xta0 commented Nov 19, 2019

Stack from ghstack:

Summary

The mobile build has been broken since last week due to a runtime error caused by a missing operator in JIT:

libc++abi.dylib: terminating with uncaught exception of type torch::jit::script::ErrorReport: 
Unknown builtin op: aten::_adaptive_avg_pool2d_backward.
Could not find any similar ops to aten::_adaptive_avg_pool2d_backward. This op may not exist or may not be currently supported in TorchScript.
:
at <string>:9:28
                grad_self = grad.expand(self.size()) / (self_size[-1] * self_size[-2])
            else:
                grad_self = torch._adaptive_avg_pool2d_backward(grad, self)
                            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE

            return grad_self

How this happens

Since we've disabled the autograd for the opensourced version, the backward ops won't get registered by JIT.

When forward runs, a GraphExecutor will be created according to the value of executor_mode. In the mobile case , this one was set to true, which gives us the ProfilingGraphExecutorImpl object. Seems like this executor will eventually try to emit IR for autograd schemas? which causes the error.

Fix

There are two ways to fix it.

  1. Add a macro to disable profiling_mode as well as executor_mode on mobile. Like what FBCODE_CAFFE2 does here.
  2. Disable the two modes in runtime, by calling torch::jit::getExecutorMode() = false; before calling forward.

(IMO, The second fix is sort of a workaround as it doesn't make sense from a user perspective (Why I need to do this). But the up side is that we don't have to introduce yet another macro )

Feel free to drop comments, if there is a better way to fix it.

How this was not detected by our mobile CI

We're working on adding runtime tests to our mobile build to prevent similar issues like this.

Test Plan

  • The error above disappears
  • Don't break CI

cc @AshkanAliabadi

Differential Revision: D18605998

@xta0 xta0 requested a review from apaszke as a code owner November 19, 2019 04:49
xta0 added a commit that referenced this pull request Nov 19, 2019
ghstack-source-id: 06b4886
Pull Request resolved: #30067
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Nov 19, 2019
@xta0 xta0 requested a review from ljk53 November 19, 2019 05:14
@xta0 xta0 added the oncall: mobile Related to mobile support, including iOS and Android label Nov 19, 2019
### Summary

The mobile build from master was broken for two weeks due to a runtime error:

```shell
libc++abi.dylib: terminating with uncaught exception of type torch::jit::script::ErrorReport: 
Unknown builtin op: aten::_adaptive_avg_pool2d_backward.
Could not find any similar ops to aten::_adaptive_avg_pool2d_backward. This op may not exist or may not be currently supported in TorchScript.
:
at <string>:9:28
                grad_self = grad.expand(self.size()) / (self_size[-1] * self_size[-2])
            else:
                grad_self = torch._adaptive_avg_pool2d_backward(grad, self)
                            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE

            return grad_self
```
Since we've disabled the autograd for the opensourced version, the `backward` ops won't get registered by JIT. 

When `forward` runs, `ProfilingGraphExecutorImpl::getPlanFor(Stack& stack)` gets called, which will try to invoke `differentiate` according to the value of `needs_gradient`. 

```cpp
 // TODO: insert grad propagation
  bool needs_gradient = getProfilingMode()
      ? needsGradientInProfilingMode(copy->block())
      : true;
  if (needs_gradient) {
    // for Simple Executor skip creating autodiff graphs
    // and let autograd handle backward for us
    if (getProfilingMode()) {
      auto diff_nodes = CreateAutodiffSubgraphs(
        copy,
        getAutodiffSubgraphInlining() ? autodiffSubgraphNodeThreshold : 1);
      for (Node *dnode : diff_nodes) {
        auto diff_graph = std::move(dnode->g(attr::Subgraph));
        Gradient gradient = differentiate(diff_graph);
        runOptimization(gradient.f);
        // run non diff optimization on the forward graph
        runNondiffOptimization(gradient.f);
        packGradient(gradient, dnode);
      }
      InlineAutodiffSubgraphs(copy, getAutodiffSubgraphInlining()
                                        ? autodiffSubgraphInlineThreshold
                                        : 1);
    }
  } else {
    runNondiffOptimization(copy);
  }
```

The fix is sort of a workaround to get rid of the `profiling_mode`. Feel free to drop comments, if there is a better way to fix it.

### Test Plan

- The error above disappears
- Don't break CI

cc @AshkanAliabadi
xta0 added a commit that referenced this pull request Nov 19, 2019
ghstack-source-id: 810d00c
Pull Request resolved: #30067
@xta0 xta0 changed the title [Mobile] Disable profiling_model for mobile [Mobile] Disable ProfilingGraphExecutorImpl for mobile Nov 19, 2019
@ljk53
Copy link
Contributor

ljk53 commented Nov 19, 2019

@zdevito - do you have preference how we should fix this? Is there any case we need enable profiling on mobile build? If not probably we can use C10_MOBILE macro to set this off by default. Autograd doesn't need this to work, correct? (we need autograd for federate learning on mobile)

@zdevito
Copy link
Contributor

zdevito commented Nov 19, 2019

This fix is what we want for now. It was an oversite that this got enabled for mobile.

@facebook-github-bot
Copy link
Contributor

@xta0 merged this pull request in 2367e71.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: jit Add this issue/PR to JIT oncall triage queue oncall: mobile Related to mobile support, including iOS and Android

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants