[pytorch][perf] add eigen blas for mobile build #26508

ljk53 · 2019-09-19T22:33:47Z

Stack from ghstack:

[pytorch] update android/iOS build library packing #26565 [pytorch] update android/iOS build library packing
[pytorch][perf] add eigen blas for mobile build #26508 [pytorch][perf] add eigen blas for mobile build
[pytorch] expose USE_STATIC_DISPATCH macro to public headers #26476 [pytorch] expose USE_STATIC_DISPATCH macro to public headers
[pytorch][cmake] improve how pytorch_android cmake imports static lib #26525 [pytorch][cmake] improve how pytorch_android cmake imports static lib

Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's nice to have good fallback
implementation for other ops.

Test Plan:

Create a simple matrix multiplication script model:

import torch

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.weights = torch.ones(1000, 1000)

    def forward(self, x):
        return torch.mm(x, self.weights)

n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')

Before integrate with eigen blas:

adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch \
--model=mm.pk \
--input_dims="1000,1000" \
--input_type=float \
--warmup=5 \
--iter=5'

Milliseconds per iter: 2218.52.

After integrate with eigen blas:

adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch_eigen \
--model=mm.pk \
--input_dims="1000,1000" \
--input_type=float \
--warmup=5 \
--iter=5'

Milliseconds per iter: 314.535.

Improve MobileNetV2 single thread perf by ~5%:

adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch \
--model=mobilenetv2.pk \
--input_dims="1,3,224,224" \
--input_type=float \
--warmup=5 \
--iter=20 \
--print_output=false \
--caffe2_threadpool_force_inline=true'

Milliseconds per iter: 367.055.

adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch_eigen \
--model=mobilenetv2.pk \
--input_dims="1,3,224,224" \
--input_type=float \
--warmup=5 \
--iter=20 \
--print_output=false \
--caffe2_threadpool_force_inline=true'

Milliseconds per iter: 348.77.

Differential Revision: D17489587

Summary: Enable BLAS for pytorch mobile build using Eigen BLAS. It's not most juicy optimization for typical mobile CV models as we are already using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback implementation for other ops. Test Plan: - Create a simple matrix multiplication script model: ``` import torch class Net(torch.nn.Module): def __init__(self): super(Net, self).__init__() self.weights = torch.ones(1000, 1000) def forward(self, x): return torch.mm(x, self.weights) n = Net() module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)}) module.save('mm.pk') ``` - Before integrate with eigen blas: ``` adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5" Milliseconds per iter: 2218.52. Iters per second: 0.450751 ``` - After integrate with eigen blas: ``` adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5" Milliseconds per iter: 314.535. Iters per second: 3.17929 ``` [ghstack-poisoned]

Summary: Enable BLAS for pytorch mobile build using Eigen BLAS. It's not most juicy optimization for typical mobile CV models as we are already using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback implementation for other ops. Test Plan: - Create a simple matrix multiplication script model: ``` import torch class Net(torch.nn.Module): def __init__(self): super(Net, self).__init__() self.weights = torch.ones(1000, 1000) def forward(self, x): return torch.mm(x, self.weights) n = Net() module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)}) module.save('mm.pk') ``` - Before integrate with eigen blas: ``` adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5" Milliseconds per iter: 2218.52. Iters per second: 0.450751 ``` - After integrate with eigen blas: ``` adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5" Milliseconds per iter: 314.535. Iters per second: 3.17929 ``` ghstack-source-id: c479009 Pull Request resolved: #26508

dreiss · 2019-09-19T22:45:47Z

Looks fine to me from the mobile side. Any concerns from the core team?

ljk53 · 2019-09-20T03:04:43Z

Looking into the Android CI failure - it attempted to enable Fortran but failed. Seems it's hitting a nasty issue in eigen/cmake/language_support.cmake: https://github.com/eigenteam/eigen-git-mirror/blob/d41dc4dd74acce21fb210e7625d5d135751fa9e5/cmake/language_support.cmake#L22

Eigen introduces a custom cmake function "workaround_9220" to test language support. It first creates a dummy cmake file with "enable_language(Fortran)", then runs cmake twice - if both runs succeed then it determines Fortran is supported on the host.

For whatever reason this adhoc language test generates false positive for our Android setup - enable_language(Fortran) succeeds in its dummy test but fails fatally later on when it actually calls enable_language(Fortran) in the main cmake. Probably because it doesn't carry Android NDK options when doing ad-hoc test in a separate cmake.

More thorough fix needs to be done in eigen/cmake. One ugly workaround is to uninstall fortran compiler in our docker image. I'm also looking for alternative approach to override the failure. Suggestions are welcome.

Summary: Enable BLAS for pytorch mobile build using Eigen BLAS. It's not most juicy optimization for typical mobile CV models as we are already using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback implementation for other ops. Test Plan: - Create a simple matrix multiplication script model: ``` import torch class Net(torch.nn.Module): def __init__(self): super(Net, self).__init__() self.weights = torch.ones(1000, 1000) def forward(self, x): return torch.mm(x, self.weights) n = Net() module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)}) module.save('mm.pk') ``` - Before integrate with eigen blas: ``` adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5" Milliseconds per iter: 2218.52. Iters per second: 0.450751 ``` - After integrate with eigen blas: ``` adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5" Milliseconds per iter: 314.535. Iters per second: 3.17929 ``` - Improve MobileNetV2 single thread perf by ~5%: ``` adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true" Milliseconds per iter: 367.055. adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true" Milliseconds per iter: 348.77. ``` Differential Revision: [D17489587](https://our.internmc.facebook.com/intern/diff/D17489587) [ghstack-poisoned]

Summary: Enable BLAS for pytorch mobile build using Eigen BLAS. It's not most juicy optimization for typical mobile CV models as we are already using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback implementation for other ops. Test Plan: - Create a simple matrix multiplication script model: ``` import torch class Net(torch.nn.Module): def __init__(self): super(Net, self).__init__() self.weights = torch.ones(1000, 1000) def forward(self, x): return torch.mm(x, self.weights) n = Net() module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)}) module.save('mm.pk') ``` - Before integrate with eigen blas: ``` adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5" Milliseconds per iter: 2218.52. Iters per second: 0.450751 ``` - After integrate with eigen blas: ``` adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5" Milliseconds per iter: 314.535. Iters per second: 3.17929 ``` - Improve MobileNetV2 single thread perf by ~5%: ``` adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true" Milliseconds per iter: 367.055. adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true" Milliseconds per iter: 348.77. ``` ghstack-source-id: d78af74 Pull Request resolved: #26508

ljk53 · 2019-09-20T05:43:04Z

I decided to create a new cmake file under cmake/External/EigenBLAS.cmake. It's simple enough and allows me: 1) work around the fortran compiler test bug; 2) make other cosmetic changes like not creating dynamic library but creating and installing static library.

dzhulgakov

Looks good! Now we just need to fix NNPACK + groups :)

ezyang · 2019-09-20T17:28:02Z

cc @xuhdev, if you are interested

ezyang

sure

xuhdev · 2019-09-20T18:51:39Z

This is good, thanks!

Summary: Pull Request resolved: pytorch/pytorch#26508 Enable BLAS for pytorch mobile build using Eigen BLAS. It's not most juicy optimization for typical mobile CV models as we are already using NNPACK/QNNPACK for most ops there. But it's nice to have good fallback implementation for other ops. Test Plan: - Create a simple matrix multiplication script model: ``` import torch class Net(torch.nn.Module): def __init__(self): super(Net, self).__init__() self.weights = torch.ones(1000, 1000) def forward(self, x): return torch.mm(x, self.weights) n = Net() module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)}) module.save('mm.pk') ``` - Before integrate with eigen blas: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch \ --model=mm.pk \ --input_dims="1000,1000" \ --input_type=float \ --warmup=5 \ --iter=5' Milliseconds per iter: 2218.52. ``` - After integrate with eigen blas: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch_eigen \ --model=mm.pk \ --input_dims="1000,1000" \ --input_type=float \ --warmup=5 \ --iter=5' Milliseconds per iter: 314.535. ``` - Improve MobileNetV2 single thread perf by ~5%: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch \ --model=mobilenetv2.pk \ --input_dims="1,3,224,224" \ --input_type=float \ --warmup=5 \ --iter=20 \ --print_output=false \ --caffe2_threadpool_force_inline=true' Milliseconds per iter: 367.055. adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch_eigen \ --model=mobilenetv2.pk \ --input_dims="1,3,224,224" \ --input_type=float \ --warmup=5 \ --iter=20 \ --print_output=false \ --caffe2_threadpool_force_inline=true' Milliseconds per iter: 348.77. ``` Differential Revision: D17489587 fbshipit-source-id: efe542db810a900f680da7ec7e60f215f58db66e

facebook-github-bot · 2019-09-21T02:37:22Z

This pull request has been merged in d6e3aed.

Summary: Pull Request resolved: pytorch#26508 Enable BLAS for pytorch mobile build using Eigen BLAS. It's not most juicy optimization for typical mobile CV models as we are already using NNPACK/QNNPACK for most ops there. But it's nice to have good fallback implementation for other ops. Test Plan: - Create a simple matrix multiplication script model: ``` import torch class Net(torch.nn.Module): def __init__(self): super(Net, self).__init__() self.weights = torch.ones(1000, 1000) def forward(self, x): return torch.mm(x, self.weights) n = Net() module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)}) module.save('mm.pk') ``` - Before integrate with eigen blas: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch \ --model=mm.pk \ --input_dims="1000,1000" \ --input_type=float \ --warmup=5 \ --iter=5' Milliseconds per iter: 2218.52. ``` - After integrate with eigen blas: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch_eigen \ --model=mm.pk \ --input_dims="1000,1000" \ --input_type=float \ --warmup=5 \ --iter=5' Milliseconds per iter: 314.535. ``` - Improve MobileNetV2 single thread perf by ~5%: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch \ --model=mobilenetv2.pk \ --input_dims="1,3,224,224" \ --input_type=float \ --warmup=5 \ --iter=20 \ --print_output=false \ --caffe2_threadpool_force_inline=true' Milliseconds per iter: 367.055. adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch_eigen \ --model=mobilenetv2.pk \ --input_dims="1,3,224,224" \ --input_type=float \ --warmup=5 \ --iter=20 \ --print_output=false \ --caffe2_threadpool_force_inline=true' Milliseconds per iter: 348.77. ``` Differential Revision: D17489587 fbshipit-source-id: efe542db810a900f680da7ec7e60f215f58db66e

ljk53 mentioned this pull request Sep 19, 2019

[pytorch] expose USE_STATIC_DISPATCH macro to public headers #26476

Closed

pytorchbot added the module: build Build system issues label Sep 19, 2019

ljk53 mentioned this pull request Sep 19, 2019

[pytorch][mobile] turn off autograd mode in android JNI wrapper #26477

Closed

ljk53 requested review from AshkanAliabadi, dreiss, dzhulgakov and ezyang September 19, 2019 22:34

ljk53 mentioned this pull request Sep 20, 2019

[pytorch][cmake] improve how pytorch_android cmake imports static lib #26525

Closed

pytorchbot added the module: android Related to Android support label Sep 20, 2019

dzhulgakov approved these changes Sep 20, 2019

View reviewed changes

ezyang approved these changes Sep 20, 2019

View reviewed changes

ljk53 mentioned this pull request Sep 20, 2019

[pytorch] update android/iOS build library packing #26565

Closed

facebook-github-bot closed this in d6e3aed Sep 20, 2019

facebook-github-bot added the merged label Sep 21, 2019

facebook-github-bot deleted the gh/ljk53/51/head branch October 28, 2019 22:16

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pytorch][perf] add eigen blas for mobile build #26508

[pytorch][perf] add eigen blas for mobile build #26508

Uh oh!

ljk53 commented Sep 19, 2019 •

edited

Loading

Uh oh!

dreiss commented Sep 19, 2019

Uh oh!

ljk53 commented Sep 20, 2019

Uh oh!

ljk53 commented Sep 20, 2019

Uh oh!

dzhulgakov left a comment

Uh oh!

ezyang commented Sep 20, 2019

Uh oh!

ezyang left a comment

Uh oh!

xuhdev commented Sep 20, 2019

Uh oh!

facebook-github-bot commented Sep 21, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

[pytorch][perf] add eigen blas for mobile build #26508

[pytorch][perf] add eigen blas for mobile build #26508

Uh oh!

Conversation

ljk53 commented Sep 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dreiss commented Sep 19, 2019

Uh oh!

ljk53 commented Sep 20, 2019

Uh oh!

ljk53 commented Sep 20, 2019

Uh oh!

dzhulgakov left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Sep 20, 2019

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

xuhdev commented Sep 20, 2019

Uh oh!

facebook-github-bot commented Sep 21, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

ljk53 commented Sep 19, 2019 •

edited

Loading