Skip to content

Conversation

@jamesr66a
Copy link
Collaborator

@jamesr66a jamesr66a commented Apr 8, 2019

I've been messing around with vectorizing the fusion compiler in JIT, and noticed that these ops were pathologically slow. I moved them to use TensorIterator + Vec256<> and got some speed wins.

Benchmark script:

import torch, time

ops = ['abs', 'neg', 'reciprocal', 'frac']

x = torch.rand(1024, 1024)
NITER = 10000

print('op', 'time per iter (ms)', 'gops/s', 'GB/s', sep='\t')

for op in ops:
    s = time.time()
    for i in range(NITER):
        getattr(x, op)()
    elapsed_sec = ((time.time() - s) / NITER)
    print(op, elapsed_sec * 1000, (1024*1024/elapsed_sec)/1e9, (1024*1024*4*2) / elapsed_sec / 1e9, sep='\t')

Before this change (on my mac with a skylake):

op      time per iter (ms)      gops/s  GB/s
abs     0.9730974197387695      1.0775652866097343      8.620522292877874
neg     1.0723679780960083      0.9778136063534356      7.822508850827485
reciprocal      1.2610594034194946      0.8315040490215421      6.6520323921723366
frac    1.1681334018707275      0.8976509004200546      7.181207203360437

After this change:

op      time per iter (ms)      gops/s  GB/s
abs     0.5031076192855835      2.084198210889721       16.673585687117768
neg     0.4433974027633667      2.3648672578256087      18.91893806260487
reciprocal      0.47145988941192624     2.2241043693195985      17.79283495455679
frac    0.5036592721939087      2.0819154096627024      16.65532327730162

So, after this change it looks like we are hitting machine peak for bandwidth and are bandwidth bound.

Tensor & fill_(const Tensor & value);
Tensor floor() const;
Tensor & floor_();
Tensor frac() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these moving around?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Apr 8, 2019

Did you look at Declarations.cwrap to see if you can remove some legacy functions or restricted them to CUDA etc.?

return result;
}

// ***** abs *****
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can make use of some of the below macros, no?

If not, I think it'd be cleaner to add, there seems to be a lot of commonality between these functions.

iter,
[=](scalar_t a) -> scalar_t { return a - std::trunc(a); },
[=](Vec256<scalar_t> a) {
return a - a.trunc();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add this to vec256 as a ".frac()' function in case we'll find a better way of doing this down the road. Same for neg below. Then you can also use the macros to get rid of the boilerplate.

#include <ATen/Parallel.h>
#include <ATen/native/UnaryOps.h>
#include <ATen/native/TensorIterator.h>
#include <ATen/cpu/vec256/vec256_base.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this one needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is old from the original implementation. Can delete

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Apr 9, 2019

Could you revisit the relevant tests for this and check for dtype coverage and large input tensors? This has been an issue in the past.


// Negation. Defined here so we can utilize operator-

Vec256<int64_t> Vec256<int64_t>::neg() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There does exist an xor instruction for integers as well (for AVX2+). - This could aid further optimization.

('cosh', (S, S, S), NO_ARGS, '', (True,)),
('cosh', (), NO_ARGS, 'scalar', (True,)),
('abs', (S, S, S), NO_ARGS, '', (True,)),
('abs', (L, L, L), NO_ARGS, '', (True,)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, are these being used for benchmarking? These tests are really slow when run under test_autograd and they don't help correctness.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change them back. I'm just trying to get them to be large enough to make sure they down the vector path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we should definitely have some tests for the vector path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just make it a slow test (and maybe only the forward?).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My decision here is to just leave common_method_invocations as it is on master and rely on the test_torch tests to ensure the functionality of this PR

types:
- floating_point
backends:
- CPU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't you kill the entire entry instead of just the CPU one? How are these called?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're called from the stubs in CUDAUnaryOps.cpp

Copy link
Collaborator Author

@jamesr66a jamesr66a Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually turns out they're not so I'm gonna delete these

@ezyang ezyang added oncall: jit Add this issue/PR to JIT oncall triage queue module: vectorization Related to SIMD vectorization, e.g., Vec256 labels Apr 9, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesr66a is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Apr 10, 2019
Summary:
I've been messing around with vectorizing the fusion compiler in JIT, and noticed that these ops were pathologically slow. I moved them to use TensorIterator + Vec256<> and got some speed wins.

Benchmark script:

```
import torch, time

ops = ['abs', 'neg', 'reciprocal', 'frac']

x = torch.rand(1024, 1024)
NITER = 10000

print('op', 'time per iter (ms)', 'gops/s', 'GB/s', sep='\t')

for op in ops:
    s = time.time()
    for i in range(NITER):
        getattr(x, op)()
    elapsed_sec = ((time.time() - s) / NITER)
    print(op, elapsed_sec * 1000, (1024*1024/elapsed_sec)/1e9, (1024*1024*4*2) / elapsed_sec / 1e9, sep='\t')

```

Before this change (on my mac with a skylake):
```
op      time per iter (ms)      gops/s  GB/s
abs     0.9730974197387695      1.0775652866097343      8.620522292877874
neg     1.0723679780960083      0.9778136063534356      7.822508850827485
reciprocal      1.2610594034194946      0.8315040490215421      6.6520323921723366
frac    1.1681334018707275      0.8976509004200546      7.181207203360437
```

After this change:
```
op      time per iter (ms)      gops/s  GB/s
abs     0.5031076192855835      2.084198210889721       16.673585687117768
neg     0.4433974027633667      2.3648672578256087      18.91893806260487
reciprocal      0.47145988941192624     2.2241043693195985      17.79283495455679
frac    0.5036592721939087      2.0819154096627024      16.65532327730162
```

So, after this change it looks like we are hitting machine peak for bandwidth and are bandwidth bound.
Pull Request resolved: pytorch/pytorch#19041

Differential Revision: D14862037

Pulled By: jamesr66a

fbshipit-source-id: e2032ac0ca962dbf4120bb36812277c260e22912
@facebook-github-bot
Copy link
Contributor

@jamesr66a merged this pull request in 82b5705.

zhangguanheng66 pushed a commit to zhangguanheng66/pytorch that referenced this pull request May 6, 2019
Summary:
I've been messing around with vectorizing the fusion compiler in JIT, and noticed that these ops were pathologically slow. I moved them to use TensorIterator + Vec256<> and got some speed wins.

Benchmark script:

```
import torch, time

ops = ['abs', 'neg', 'reciprocal', 'frac']

x = torch.rand(1024, 1024)
NITER = 10000

print('op', 'time per iter (ms)', 'gops/s', 'GB/s', sep='\t')

for op in ops:
    s = time.time()
    for i in range(NITER):
        getattr(x, op)()
    elapsed_sec = ((time.time() - s) / NITER)
    print(op, elapsed_sec * 1000, (1024*1024/elapsed_sec)/1e9, (1024*1024*4*2) / elapsed_sec / 1e9, sep='\t')

```

Before this change (on my mac with a skylake):
```
op      time per iter (ms)      gops/s  GB/s
abs     0.9730974197387695      1.0775652866097343      8.620522292877874
neg     1.0723679780960083      0.9778136063534356      7.822508850827485
reciprocal      1.2610594034194946      0.8315040490215421      6.6520323921723366
frac    1.1681334018707275      0.8976509004200546      7.181207203360437
```

After this change:
```
op      time per iter (ms)      gops/s  GB/s
abs     0.5031076192855835      2.084198210889721       16.673585687117768
neg     0.4433974027633667      2.3648672578256087      18.91893806260487
reciprocal      0.47145988941192624     2.2241043693195985      17.79283495455679
frac    0.5036592721939087      2.0819154096627024      16.65532327730162
```

So, after this change it looks like we are hitting machine peak for bandwidth and are bandwidth bound.
Pull Request resolved: pytorch#19041

Differential Revision: D14862037

Pulled By: jamesr66a

fbshipit-source-id: e2032ac0ca962dbf4120bb36812277c260e22912
facebook-github-bot pushed a commit that referenced this pull request May 31, 2019
Summary:
This is a follow up on Jame's PR: #19041. The idea is to replace the legacy `sinh` / `cosh` ops that are being dispatched to TH with the operations defined in `Vec256` for better performance.

benchmark(from Jame's script):

```python
import torch, time
ops = ['sinh', 'cosh']
x = torch.rand(1024, 1024)
NITER = 10000

print('op', 'time per iter (ms)', 'gops/s', 'GB/s', sep='\t')
for op in ops:
    s = time.time()
    for i in range(NITER):
        getattr(x, op)()
    elapsed_sec = ((time.time() - s) / NITER)
    print(op, elapsed_sec * 1000, (1024*1024/elapsed_sec)/1e9, (1024*1024*4*2) / elapsed_sec / 1e9, sep='\t')
```
code on master:

```
op	time per iter (ms)	gops/s	GB/s
sinh	3.37614369392395	0.3105839369002935	2.484671495202348
cosh	3.480502033233643	0.3012714803748572	2.4101718429988574
```
after change (on Macbook pro 2018):

```
op	time per iter (ms)	gops/s	GB/s
sinh	0.8956503868103027	1.1707425301677301	9.365940241341841
cosh	0.9392147302627564	1.1164390487217428	8.931512389773943
```
Pull Request resolved: #21115

Reviewed By: ljk53

Differential Revision: D15574580

Pulled By: xta0

fbshipit-source-id: 392546a0df11ed4f0945f2bc84bf5dea2750b60e
zdevito pushed a commit to zdevito/ATen that referenced this pull request May 31, 2019
Summary:
This is a follow up on Jame's PR: pytorch/pytorch#19041. The idea is to replace the legacy `sinh` / `cosh` ops that are being dispatched to TH with the operations defined in `Vec256` for better performance.

benchmark(from Jame's script):

```python
import torch, time
ops = ['sinh', 'cosh']
x = torch.rand(1024, 1024)
NITER = 10000

print('op', 'time per iter (ms)', 'gops/s', 'GB/s', sep='\t')
for op in ops:
    s = time.time()
    for i in range(NITER):
        getattr(x, op)()
    elapsed_sec = ((time.time() - s) / NITER)
    print(op, elapsed_sec * 1000, (1024*1024/elapsed_sec)/1e9, (1024*1024*4*2) / elapsed_sec / 1e9, sep='\t')
```
code on master:

```
op	time per iter (ms)	gops/s	GB/s
sinh	3.37614369392395	0.3105839369002935	2.484671495202348
cosh	3.480502033233643	0.3012714803748572	2.4101718429988574
```
after change (on Macbook pro 2018):

```
op	time per iter (ms)	gops/s	GB/s
sinh	0.8956503868103027	1.1707425301677301	9.365940241341841
cosh	0.9392147302627564	1.1164390487217428	8.931512389773943
```
Pull Request resolved: pytorch/pytorch#21115

Reviewed By: ljk53

Differential Revision: D15574580

Pulled By: xta0

fbshipit-source-id: 392546a0df11ed4f0945f2bc84bf5dea2750b60e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: vectorization Related to SIMD vectorization, e.g., Vec256 oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants