-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[pytorch][perf] add eigen blas for mobile build #26508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback
implementation for other ops.
Test Plan:
- Create a simple matrix multiplication script model:
```
import torch
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.weights = torch.ones(1000, 1000)
def forward(self, x):
return torch.mm(x, self.weights)
n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```
- Before integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 2218.52. Iters per second: 0.450751
```
- After integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 314.535. Iters per second: 3.17929
```
[ghstack-poisoned]
Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback
implementation for other ops.
Test Plan:
- Create a simple matrix multiplication script model:
```
import torch
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.weights = torch.ones(1000, 1000)
def forward(self, x):
return torch.mm(x, self.weights)
n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```
- Before integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 2218.52. Iters per second: 0.450751
```
- After integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 314.535. Iters per second: 3.17929
```
ghstack-source-id: c479009
Pull Request resolved: #26508
|
Looks fine to me from the mobile side. Any concerns from the core team? |
|
Looking into the Android CI failure - it attempted to enable Fortran but failed. Seems it's hitting a nasty issue in eigen/cmake/language_support.cmake: https://github.com/eigenteam/eigen-git-mirror/blob/d41dc4dd74acce21fb210e7625d5d135751fa9e5/cmake/language_support.cmake#L22 Eigen introduces a custom cmake function "workaround_9220" to test language support. It first creates a dummy cmake file with "enable_language(Fortran)", then runs cmake twice - if both runs succeed then it determines Fortran is supported on the host. For whatever reason this adhoc language test generates false positive for our Android setup - enable_language(Fortran) succeeds in its dummy test but fails fatally later on when it actually calls enable_language(Fortran) in the main cmake. Probably because it doesn't carry Android NDK options when doing ad-hoc test in a separate cmake. More thorough fix needs to be done in eigen/cmake. One ugly workaround is to uninstall fortran compiler in our docker image. I'm also looking for alternative approach to override the failure. Suggestions are welcome. |
Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback
implementation for other ops.
Test Plan:
- Create a simple matrix multiplication script model:
```
import torch
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.weights = torch.ones(1000, 1000)
def forward(self, x):
return torch.mm(x, self.weights)
n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```
- Before integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 2218.52. Iters per second: 0.450751
```
- After integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 314.535. Iters per second: 3.17929
```
- Improve MobileNetV2 single thread perf by ~5%:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 367.055.
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 348.77.
```
Differential Revision: [D17489587](https://our.internmc.facebook.com/intern/diff/D17489587)
[ghstack-poisoned]
Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback
implementation for other ops.
Test Plan:
- Create a simple matrix multiplication script model:
```
import torch
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.weights = torch.ones(1000, 1000)
def forward(self, x):
return torch.mm(x, self.weights)
n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```
- Before integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 2218.52. Iters per second: 0.450751
```
- After integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 314.535. Iters per second: 3.17929
```
- Improve MobileNetV2 single thread perf by ~5%:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 367.055.
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 348.77.
```
Differential Revision: [D17489587](https://our.internmc.facebook.com/intern/diff/D17489587)
[ghstack-poisoned]
Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback
implementation for other ops.
Test Plan:
- Create a simple matrix multiplication script model:
```
import torch
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.weights = torch.ones(1000, 1000)
def forward(self, x):
return torch.mm(x, self.weights)
n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```
- Before integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 2218.52. Iters per second: 0.450751
```
- After integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 314.535. Iters per second: 3.17929
```
- Improve MobileNetV2 single thread perf by ~5%:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 367.055.
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 348.77.
```
ghstack-source-id: d78af74
Pull Request resolved: #26508
|
I decided to create a new cmake file under cmake/External/EigenBLAS.cmake. It's simple enough and allows me: 1) work around the fortran compiler test bug; 2) make other cosmetic changes like not creating dynamic library but creating and installing static library. |
dzhulgakov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Now we just need to fix NNPACK + groups :)
|
cc @xuhdev, if you are interested |
ezyang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
|
This is good, thanks! |
Summary: Pull Request resolved: pytorch/pytorch#26508 Enable BLAS for pytorch mobile build using Eigen BLAS. It's not most juicy optimization for typical mobile CV models as we are already using NNPACK/QNNPACK for most ops there. But it's nice to have good fallback implementation for other ops. Test Plan: - Create a simple matrix multiplication script model: ``` import torch class Net(torch.nn.Module): def __init__(self): super(Net, self).__init__() self.weights = torch.ones(1000, 1000) def forward(self, x): return torch.mm(x, self.weights) n = Net() module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)}) module.save('mm.pk') ``` - Before integrate with eigen blas: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch \ --model=mm.pk \ --input_dims="1000,1000" \ --input_type=float \ --warmup=5 \ --iter=5' Milliseconds per iter: 2218.52. ``` - After integrate with eigen blas: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch_eigen \ --model=mm.pk \ --input_dims="1000,1000" \ --input_type=float \ --warmup=5 \ --iter=5' Milliseconds per iter: 314.535. ``` - Improve MobileNetV2 single thread perf by ~5%: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch \ --model=mobilenetv2.pk \ --input_dims="1,3,224,224" \ --input_type=float \ --warmup=5 \ --iter=20 \ --print_output=false \ --caffe2_threadpool_force_inline=true' Milliseconds per iter: 367.055. adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch_eigen \ --model=mobilenetv2.pk \ --input_dims="1,3,224,224" \ --input_type=float \ --warmup=5 \ --iter=20 \ --print_output=false \ --caffe2_threadpool_force_inline=true' Milliseconds per iter: 348.77. ``` Differential Revision: D17489587 fbshipit-source-id: efe542db810a900f680da7ec7e60f215f58db66e
|
This pull request has been merged in d6e3aed. |
Summary: Pull Request resolved: pytorch#26508 Enable BLAS for pytorch mobile build using Eigen BLAS. It's not most juicy optimization for typical mobile CV models as we are already using NNPACK/QNNPACK for most ops there. But it's nice to have good fallback implementation for other ops. Test Plan: - Create a simple matrix multiplication script model: ``` import torch class Net(torch.nn.Module): def __init__(self): super(Net, self).__init__() self.weights = torch.ones(1000, 1000) def forward(self, x): return torch.mm(x, self.weights) n = Net() module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)}) module.save('mm.pk') ``` - Before integrate with eigen blas: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch \ --model=mm.pk \ --input_dims="1000,1000" \ --input_type=float \ --warmup=5 \ --iter=5' Milliseconds per iter: 2218.52. ``` - After integrate with eigen blas: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch_eigen \ --model=mm.pk \ --input_dims="1000,1000" \ --input_type=float \ --warmup=5 \ --iter=5' Milliseconds per iter: 314.535. ``` - Improve MobileNetV2 single thread perf by ~5%: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch \ --model=mobilenetv2.pk \ --input_dims="1,3,224,224" \ --input_type=float \ --warmup=5 \ --iter=20 \ --print_output=false \ --caffe2_threadpool_force_inline=true' Milliseconds per iter: 367.055. adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch_eigen \ --model=mobilenetv2.pk \ --input_dims="1,3,224,224" \ --input_type=float \ --warmup=5 \ --iter=20 \ --print_output=false \ --caffe2_threadpool_force_inline=true' Milliseconds per iter: 348.77. ``` Differential Revision: D17489587 fbshipit-source-id: efe542db810a900f680da7ec7e60f215f58db66e
Stack from ghstack:
Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's nice to have good fallback
implementation for other ops.
Test Plan:
Differential Revision: D17489587