Skip to content

Conversation

@xta0
Copy link
Contributor

@xta0 xta0 commented May 30, 2019

This is a follow up on Jame's PR: #19041. The idea is to replace the legacy sinh / cosh ops that are being dispatched to TH with the operations defined in Vec256 for better performance.

benchmark(from Jame's script):

import torch, time
ops = ['sinh', 'cosh']
x = torch.rand(1024, 1024)
NITER = 10000

print('op', 'time per iter (ms)', 'gops/s', 'GB/s', sep='\t')
for op in ops:
    s = time.time()
    for i in range(NITER):
        getattr(x, op)()
    elapsed_sec = ((time.time() - s) / NITER)
    print(op, elapsed_sec * 1000, (1024*1024/elapsed_sec)/1e9, (1024*1024*4*2) / elapsed_sec / 1e9, sep='\t')

code on master:

op	time per iter (ms)	gops/s	GB/s
sinh	3.37614369392395	0.3105839369002935	2.484671495202348
cosh	3.480502033233643	0.3012714803748572	2.4101718429988574

after change (on Macbook pro 2018):

op	time per iter (ms)	gops/s	GB/s
sinh	0.8956503868103027	1.1707425301677301	9.365940241341841
cosh	0.9392147302627564	1.1164390487217428	8.931512389773943

@pytorchbot pytorchbot added module: cpu CPU specific problem (e.g., perf, algorithm) module: operators labels May 30, 2019
IMPLEMENT_UNARY_OP_VEC(ceil)
IMPLEMENT_UNARY_OP_VEC(cos)
IMPLEMENT_UNARY_OP_TH(cosh)
IMPLEMENT_UNARY_OP_VEC(cosh)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the IMPLEMENT_UNARY_OP_TH macro defined above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@ljk53
Copy link
Contributor

ljk53 commented May 30, 2019

And did you try removing TH cpu backend for _th_cosh & _th_sinh from aten/src/ATen/Declarations.cwrap?

@cpuhrsch
Copy link
Contributor

Please notice this comment and also this comment

cc @VitalyFedyunin

@cpuhrsch cpuhrsch requested a review from VitalyFedyunin May 30, 2019 14:31
@cpuhrsch
Copy link
Contributor

glibc's math library doesn't vectorize sinh and cosh and the float specializations of vec256's sinh and cosh doesn't use anything better than that.

@cpuhrsch
Copy link
Contributor

The sleef sinh operations are limit in ULP1 range, whereas glibc guarantees ULP2 for the full range. Looking at the function this sort of doesn't make any sense in the tails (since the function grows so large) and yet we want to still produce values that are standardized on something. We also don't want to produce 0s, which is what SLEEF would do.

Now, VML is compliant with glibc as far as I know, but there are issues with clang as mentioned. sinh and cosh were disabled as part of this PR. It's true that I should have added more detail, but we have yet to do a full investigation into why this fails for clang or at least disable it for clang.

I'd say the better strategy is to

  1. Reenable VML for sinh and cosh via the performant macro used for the other unary functions and within the vml header
  2. Closely monitor clang's behavior and investigate any failures.

But the current PR likely mostly takes advantage of the improved memory read patterns. Take care that the transition penalty bug isn't happening and see how other Unary functions deal with it.

@pytorchbot pytorchbot added the module: internals Related to internal abstractions in c10 and ATen label May 30, 2019
@xta0
Copy link
Contributor Author

xta0 commented May 30, 2019

And did you try removing TH cpu backend for _th_cosh & _th_sinh from aten/src/ATen/Declarations.cwrap?

Yes, I've removed the cpu backend for both of them.

@xta0
Copy link
Contributor Author

xta0 commented May 30, 2019

glibc's math library doesn't vectorize sinh and cosh and the float specializations of vec256's sinh and cosh doesn't use anything better than that.

Yea, I switched the method from unary_kernel_vec to unary_kernel.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xta0 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@fmassa
Copy link
Member

fmassa commented May 31, 2019

While you are st it, can you also do the same thing for CUDA? It shouldn't be much more work, and makes everything simpler

IMPLEMENT_UNARY_OP_VEC(ceil)
IMPLEMENT_UNARY_OP_VEC(cos)
IMPLEMENT_UNARY_OP_TH(cosh)
IMPLEMENT_UNARY_OP_VEC(cosh)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we comment here, that kernels are not using Vec256 (because _VEC is confusing )

Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good. PR would become perfect if you remove dead TH operators code.

@facebook-github-bot
Copy link
Contributor

@xta0 merged this pull request in 052bab7.

zdevito pushed a commit to zdevito/ATen that referenced this pull request May 31, 2019
Summary:
This is a follow up on Jame's PR: pytorch/pytorch#19041. The idea is to replace the legacy `sinh` / `cosh` ops that are being dispatched to TH with the operations defined in `Vec256` for better performance.

benchmark(from Jame's script):

```python
import torch, time
ops = ['sinh', 'cosh']
x = torch.rand(1024, 1024)
NITER = 10000

print('op', 'time per iter (ms)', 'gops/s', 'GB/s', sep='\t')
for op in ops:
    s = time.time()
    for i in range(NITER):
        getattr(x, op)()
    elapsed_sec = ((time.time() - s) / NITER)
    print(op, elapsed_sec * 1000, (1024*1024/elapsed_sec)/1e9, (1024*1024*4*2) / elapsed_sec / 1e9, sep='\t')
```
code on master:

```
op	time per iter (ms)	gops/s	GB/s
sinh	3.37614369392395	0.3105839369002935	2.484671495202348
cosh	3.480502033233643	0.3012714803748572	2.4101718429988574
```
after change (on Macbook pro 2018):

```
op	time per iter (ms)	gops/s	GB/s
sinh	0.8956503868103027	1.1707425301677301	9.365940241341841
cosh	0.9392147302627564	1.1164390487217428	8.931512389773943
```
Pull Request resolved: pytorch/pytorch#21115

Reviewed By: ljk53

Differential Revision: D15574580

Pulled By: xta0

fbshipit-source-id: 392546a0df11ed4f0945f2bc84bf5dea2750b60e
@VitalyFedyunin VitalyFedyunin added the module: porting Issues related to porting TH/THNN legacy to ATen native label Jul 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: internals Related to internal abstractions in c10 and ATen module: porting Issues related to porting TH/THNN legacy to ATen native

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants