-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Move legacy TH functions(sinh,cosh) to TensorIterator + Vec256 #21115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| IMPLEMENT_UNARY_OP_VEC(ceil) | ||
| IMPLEMENT_UNARY_OP_VEC(cos) | ||
| IMPLEMENT_UNARY_OP_TH(cosh) | ||
| IMPLEMENT_UNARY_OP_VEC(cosh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the IMPLEMENT_UNARY_OP_TH macro defined above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
|
And did you try removing TH cpu backend for _th_cosh & _th_sinh from aten/src/ATen/Declarations.cwrap? |
|
Please notice this comment and also this comment |
|
glibc's math library doesn't vectorize sinh and cosh and the float specializations of vec256's sinh and cosh doesn't use anything better than that. |
|
The sleef sinh operations are limit in ULP1 range, whereas glibc guarantees ULP2 for the full range. Looking at the function this sort of doesn't make any sense in the tails (since the function grows so large) and yet we want to still produce values that are standardized on something. We also don't want to produce 0s, which is what SLEEF would do. Now, VML is compliant with glibc as far as I know, but there are issues with clang as mentioned. sinh and cosh were disabled as part of this PR. It's true that I should have added more detail, but we have yet to do a full investigation into why this fails for clang or at least disable it for clang. I'd say the better strategy is to
But the current PR likely mostly takes advantage of the improved memory read patterns. Take care that the transition penalty bug isn't happening and see how other Unary functions deal with it. |
Yes, I've removed the cpu backend for both of them. |
Yea, I switched the method from |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xta0 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
While you are st it, can you also do the same thing for CUDA? It shouldn't be much more work, and makes everything simpler |
| IMPLEMENT_UNARY_OP_VEC(ceil) | ||
| IMPLEMENT_UNARY_OP_VEC(cos) | ||
| IMPLEMENT_UNARY_OP_TH(cosh) | ||
| IMPLEMENT_UNARY_OP_VEC(cosh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we comment here, that kernels are not using Vec256 (because _VEC is confusing )
VitalyFedyunin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good. PR would become perfect if you remove dead TH operators code.
Summary: This is a follow up on Jame's PR: pytorch/pytorch#19041. The idea is to replace the legacy `sinh` / `cosh` ops that are being dispatched to TH with the operations defined in `Vec256` for better performance. benchmark(from Jame's script): ```python import torch, time ops = ['sinh', 'cosh'] x = torch.rand(1024, 1024) NITER = 10000 print('op', 'time per iter (ms)', 'gops/s', 'GB/s', sep='\t') for op in ops: s = time.time() for i in range(NITER): getattr(x, op)() elapsed_sec = ((time.time() - s) / NITER) print(op, elapsed_sec * 1000, (1024*1024/elapsed_sec)/1e9, (1024*1024*4*2) / elapsed_sec / 1e9, sep='\t') ``` code on master: ``` op time per iter (ms) gops/s GB/s sinh 3.37614369392395 0.3105839369002935 2.484671495202348 cosh 3.480502033233643 0.3012714803748572 2.4101718429988574 ``` after change (on Macbook pro 2018): ``` op time per iter (ms) gops/s GB/s sinh 0.8956503868103027 1.1707425301677301 9.365940241341841 cosh 0.9392147302627564 1.1164390487217428 8.931512389773943 ``` Pull Request resolved: pytorch/pytorch#21115 Reviewed By: ljk53 Differential Revision: D15574580 Pulled By: xta0 fbshipit-source-id: 392546a0df11ed4f0945f2bc84bf5dea2750b60e
This is a follow up on Jame's PR: #19041. The idea is to replace the legacy
sinh/coshops that are being dispatched to TH with the operations defined inVec256for better performance.benchmark(from Jame's script):
code on master:
after change (on Macbook pro 2018):