Skip to content

Conversation

@apaszke
Copy link
Contributor

@apaszke apaszke commented Sep 13, 2018

  • Disable addmm fusion. The reason for this is explained in the comment.
  • Tiny change in stack.h that lets us avoid constructing an unnecessary temporary IValue on the (C++) stack (it will only get created on the interpreter stack directly).
  • Fixed a correctness issue in requires grad propagation

@pytorchbot pytorchbot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Sep 13, 2018
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apaszke has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apaszke has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apaszke has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apaszke has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks good.

facebook-github-bot pushed a commit to facebookresearch/ReAgent that referenced this pull request Sep 25, 2018
Summary:
- Disable addmm fusion. The reason for this is explained in the comment.
- Tiny change in `stack.h` that lets us avoid constructing an unnecessary temporary `IValue` on the (C++) stack (it will only get created on the interpreter stack directly).
- Fixed a correctness issue in requires grad propagation
Pull Request resolved: pytorch/pytorch#11654

Reviewed By: colesbury

Differential Revision: D9813739

Pulled By: apaszke

fbshipit-source-id: 23e83bc8605802f39bfecf447efad9239b9421c3
@jmp84
Copy link
Contributor

jmp84 commented Oct 10, 2018

@apaszke, this is causing a ~200ms latency regression for NMT models. Here are the top lines from perf before/after this change:

Before:

  • 43.31% NmtBenchmark NmtBenchmark [.] mkl_blas_avx2_sgemm_kernel_nocopy_NN_b0
  • 8.67% NmtBenchmark NmtBenchmark [.] loop_inner866
  • 4.42% NmtBenchmark libc-2.23.so [.] __memcpy_avx_unaligned
  • 3.20% NmtBenchmark NmtBenchmark [.] caffe2::math::(anonymous namespace)::ReduceTensor<float, std::plus >
  • 3.04% NmtBenchmark NmtBenchmark [.] caffe2::math::utils::GetIndexFromDims
  • 2.84% NmtBenchmark NmtBenchmark [.] mkl_blas_avx2_xsgemv_t
  • 2.84% NmtBenchmark NmtBenchmark [.] fbgemm::PackedGemmMatrix<signed char, int, true>::addr_
  • 2.76% CaffeTaskThread NmtBenchmark [.] mkl_blas_avx2_xsgemv_t

After:

  • 29.04% NmtBenchmark NmtBenchmark [.] mkl_blas_avx2_sgemm_kernel_nocopy_NN_b0
  • 17.97% NmtBenchmark NmtBenchmark [.] mkl_blas_avx2_sgemm_kernel_nocopy_TN_b1
  • 8.02% NmtBenchmark NmtBenchmark [.] mkl_blas_avx2_sgemm_kernel_nocopy_TN_b0
  • 5.95% CaffeTaskThread NmtBenchmark [.] mkl_blas_avx2_xsgemv_t
  • 4.28% NmtBenchmark libc-2.23.so [.] __memcpy_avx_unaligned

Is there more info I can provide to help debug this? Also cc @jamesr66a who is familiar with these models and how to optimize them.

@apaszke
Copy link
Contributor Author

apaszke commented Oct 10, 2018

@jmp84 can you please provide me with some example tensor sizes that appear as inputs to your GEMMs? It looks like we started triggering the transposed kernels in some cases which was not the case previously (TN vs NN).

@apaszke
Copy link
Contributor Author

apaszke commented Oct 10, 2018

Finally, how do you run those models? Is that via ONNX export or what?

@jamesr66a
Copy link
Collaborator

@apaszke I think what's going on here is that we are not hitting the special quantized implementation of FC in the caffe2 backend due to the ONNX-caffe2 backend emitting "MatMul" + "Add" instead of "FC". This can be seen from loop_inner866 being present before but not after this patch. (That label is from an autogenerated GEMM kernel). I'm going to send in a patch that adds addmm fusion on the ONNX export path, but not the regular path

@dzhulgakov
Copy link
Collaborator

So for the addmm optimization - “not helpful at all” means neutral or negative? Of neutral - why not kee it if it matters for ONNX export?

@apaszke
Copy link
Contributor Author

apaszke commented Oct 11, 2018

It means negative in some RNN use cases I saw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: jit Add this issue/PR to JIT oncall triage queue open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants