Skip to content

Conversation

@ljk53
Copy link
Contributor

@ljk53 ljk53 commented Sep 19, 2019

Stack from ghstack:

Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's nice to have good fallback
implementation for other ops.

Test Plan:

  • Create a simple matrix multiplication script model:
import torch

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.weights = torch.ones(1000, 1000)

    def forward(self, x):
        return torch.mm(x, self.weights)

n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
  • Before integrate with eigen blas:
adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch \
--model=mm.pk \
--input_dims="1000,1000" \
--input_type=float \
--warmup=5 \
--iter=5'

Milliseconds per iter: 2218.52.
  • After integrate with eigen blas:
adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch_eigen \
--model=mm.pk \
--input_dims="1000,1000" \
--input_type=float \
--warmup=5 \
--iter=5'

Milliseconds per iter: 314.535.
  • Improve MobileNetV2 single thread perf by ~5%:
adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch \
--model=mobilenetv2.pk \
--input_dims="1,3,224,224" \
--input_type=float \
--warmup=5 \
--iter=20 \
--print_output=false \
--caffe2_threadpool_force_inline=true'

Milliseconds per iter: 367.055.

adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch_eigen \
--model=mobilenetv2.pk \
--input_dims="1,3,224,224" \
--input_type=float \
--warmup=5 \
--iter=20 \
--print_output=false \
--caffe2_threadpool_force_inline=true'

Milliseconds per iter: 348.77.

Differential Revision: D17489587

Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback
implementation for other ops.

Test Plan:
- Create a simple matrix multiplication script model:
```
import torch

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.weights = torch.ones(1000, 1000)

    def forward(self, x):
        return torch.mm(x, self.weights)

n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```

- Before integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 2218.52. Iters per second: 0.450751
```

- After integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 314.535. Iters per second: 3.17929
```

[ghstack-poisoned]
@pytorchbot pytorchbot added the module: build Build system issues label Sep 19, 2019
ljk53 added a commit that referenced this pull request Sep 19, 2019
Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback
implementation for other ops.

Test Plan:
- Create a simple matrix multiplication script model:
```
import torch

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.weights = torch.ones(1000, 1000)

    def forward(self, x):
        return torch.mm(x, self.weights)

n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```

- Before integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 2218.52. Iters per second: 0.450751
```

- After integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 314.535. Iters per second: 3.17929
```

ghstack-source-id: c479009
Pull Request resolved: #26508
@dreiss
Copy link
Contributor

dreiss commented Sep 19, 2019

Looks fine to me from the mobile side. Any concerns from the core team?

@ljk53
Copy link
Contributor Author

ljk53 commented Sep 20, 2019

Looking into the Android CI failure - it attempted to enable Fortran but failed. Seems it's hitting a nasty issue in eigen/cmake/language_support.cmake: https://github.com/eigenteam/eigen-git-mirror/blob/d41dc4dd74acce21fb210e7625d5d135751fa9e5/cmake/language_support.cmake#L22

Eigen introduces a custom cmake function "workaround_9220" to test language support. It first creates a dummy cmake file with "enable_language(Fortran)", then runs cmake twice - if both runs succeed then it determines Fortran is supported on the host.

For whatever reason this adhoc language test generates false positive for our Android setup - enable_language(Fortran) succeeds in its dummy test but fails fatally later on when it actually calls enable_language(Fortran) in the main cmake. Probably because it doesn't carry Android NDK options when doing ad-hoc test in a separate cmake.

More thorough fix needs to be done in eigen/cmake. One ugly workaround is to uninstall fortran compiler in our docker image. I'm also looking for alternative approach to override the failure. Suggestions are welcome.

Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback
implementation for other ops.

Test Plan:
- Create a simple matrix multiplication script model:
```
import torch

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.weights = torch.ones(1000, 1000)

    def forward(self, x):
        return torch.mm(x, self.weights)

n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```

- Before integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 2218.52. Iters per second: 0.450751
```

- After integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 314.535. Iters per second: 3.17929
```

- Improve MobileNetV2 single thread perf by ~5%:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 367.055.

adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 348.77.
```

Differential Revision: [D17489587](https://our.internmc.facebook.com/intern/diff/D17489587)

[ghstack-poisoned]
Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback
implementation for other ops.

Test Plan:
- Create a simple matrix multiplication script model:
```
import torch

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.weights = torch.ones(1000, 1000)

    def forward(self, x):
        return torch.mm(x, self.weights)

n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```

- Before integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 2218.52. Iters per second: 0.450751
```

- After integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 314.535. Iters per second: 3.17929
```

- Improve MobileNetV2 single thread perf by ~5%:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 367.055.

adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 348.77.
```

Differential Revision: [D17489587](https://our.internmc.facebook.com/intern/diff/D17489587)

[ghstack-poisoned]
ljk53 added a commit that referenced this pull request Sep 20, 2019
Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback
implementation for other ops.

Test Plan:
- Create a simple matrix multiplication script model:
```
import torch

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.weights = torch.ones(1000, 1000)

    def forward(self, x):
        return torch.mm(x, self.weights)

n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```

- Before integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 2218.52. Iters per second: 0.450751
```

- After integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 314.535. Iters per second: 3.17929
```

- Improve MobileNetV2 single thread perf by ~5%:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 367.055.

adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 348.77.
```

ghstack-source-id: d78af74
Pull Request resolved: #26508
@ljk53
Copy link
Contributor Author

ljk53 commented Sep 20, 2019

I decided to create a new cmake file under cmake/External/EigenBLAS.cmake. It's simple enough and allows me: 1) work around the fortran compiler test bug; 2) make other cosmetic changes like not creating dynamic library but creating and installing static library.

Copy link
Collaborator

@dzhulgakov dzhulgakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Now we just need to fix NNPACK + groups :)

@ezyang
Copy link
Contributor

ezyang commented Sep 20, 2019

cc @xuhdev, if you are interested

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@xuhdev
Copy link
Collaborator

xuhdev commented Sep 20, 2019

This is good, thanks!

zdevito pushed a commit to zdevito/ATen that referenced this pull request Sep 21, 2019
Summary:
Pull Request resolved: pytorch/pytorch#26508

Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's nice to have good fallback
implementation for other ops.

Test Plan:
- Create a simple matrix multiplication script model:
```
import torch

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.weights = torch.ones(1000, 1000)

    def forward(self, x):
        return torch.mm(x, self.weights)

n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```

- Before integrate with eigen blas:
```
adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch \
--model=mm.pk \
--input_dims="1000,1000" \
--input_type=float \
--warmup=5 \
--iter=5'

Milliseconds per iter: 2218.52.
```

- After integrate with eigen blas:
```
adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch_eigen \
--model=mm.pk \
--input_dims="1000,1000" \
--input_type=float \
--warmup=5 \
--iter=5'

Milliseconds per iter: 314.535.
```

- Improve MobileNetV2 single thread perf by ~5%:
```
adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch \
--model=mobilenetv2.pk \
--input_dims="1,3,224,224" \
--input_type=float \
--warmup=5 \
--iter=20 \
--print_output=false \
--caffe2_threadpool_force_inline=true'

Milliseconds per iter: 367.055.

adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch_eigen \
--model=mobilenetv2.pk \
--input_dims="1,3,224,224" \
--input_type=float \
--warmup=5 \
--iter=20 \
--print_output=false \
--caffe2_threadpool_force_inline=true'

Milliseconds per iter: 348.77.
```

Differential Revision: D17489587

fbshipit-source-id: efe542db810a900f680da7ec7e60f215f58db66e
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in d6e3aed.

mingbowan pushed a commit to mingbowan/pytorch that referenced this pull request Sep 23, 2019
Summary:
Pull Request resolved: pytorch#26508

Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's nice to have good fallback
implementation for other ops.

Test Plan:
- Create a simple matrix multiplication script model:
```
import torch

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.weights = torch.ones(1000, 1000)

    def forward(self, x):
        return torch.mm(x, self.weights)

n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```

- Before integrate with eigen blas:
```
adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch \
--model=mm.pk \
--input_dims="1000,1000" \
--input_type=float \
--warmup=5 \
--iter=5'

Milliseconds per iter: 2218.52.
```

- After integrate with eigen blas:
```
adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch_eigen \
--model=mm.pk \
--input_dims="1000,1000" \
--input_type=float \
--warmup=5 \
--iter=5'

Milliseconds per iter: 314.535.
```

- Improve MobileNetV2 single thread perf by ~5%:
```
adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch \
--model=mobilenetv2.pk \
--input_dims="1,3,224,224" \
--input_type=float \
--warmup=5 \
--iter=20 \
--print_output=false \
--caffe2_threadpool_force_inline=true'

Milliseconds per iter: 367.055.

adb shell 'cd /data/local/tmp; \
./speed_benchmark_torch_eigen \
--model=mobilenetv2.pk \
--input_dims="1,3,224,224" \
--input_type=float \
--warmup=5 \
--iter=20 \
--print_output=false \
--caffe2_threadpool_force_inline=true'

Milliseconds per iter: 348.77.
```

Differential Revision: D17489587

fbshipit-source-id: efe542db810a900f680da7ec7e60f215f58db66e
@facebook-github-bot facebook-github-bot deleted the gh/ljk53/51/head branch October 28, 2019 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: android Related to Android support module: build Build system issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants