-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Vectorize non-contiguous unary operations #8488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d632ed8 to
dadc1ac
Compare
|
Looks good, but the MacOS build failure looks legit |
8869762 to
55512ec
Compare
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@pytorchbot retest this please |
8d063e0 to
a1bebe3
Compare
colesbury
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth a comment in the code about why cosh and sinh are disabled.
51fa0c2 to
fcf65a3
Compare
…ion and _mm256_sqrt_ps. Currently for single-precision floating-point number, division and _mm256_sqrt_ps are used. While the current method is more accurate, using _mm256_rsqrt_ps is faster. See pytorch#8488 (comment) . This commit switches to _mm256_rsqrt_ps, because: - This is more consistent with the CUDA implementation, which uses rsqrt, which is the fast but less accurate version: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#standard-functions - Users who call torch.rsqrt() is more likely to seek for speed than accuracy. Otherwise, they can use torch.sqrt().reciprocal(). Another possibility would be to add an option to decide which version to use, as rsqrt has a faster derivative.
Timings are on a single core. This increases performance on non-contiguous tensors and also introduces the dlopen bugfix for glibc2.23. I'm comparing to conda intel which comes with mkl vml. It's based on this commit for pytorch's benchmark repo.
Command
Timings for this branch
Timings for master
Intel pstat was used to turn off turbo mode and set min and max freq to 100.
lscpu