SIMD: Use universal intrinsics to implement comparison functions#21483
SIMD: Use universal intrinsics to implement comparison functions#21483mattip merged 6 commits intonumpy:mainfrom
Conversation
|
@seiko2plus thoughts? This significantly speeds up integer comparisons |
685eeeb to
8da7100
Compare
|
Just noticed that I also have to add a test for array OP scalar (and scalar OP array). |
8da7100 to
0bb6b36
Compare
|
I already have the required tests. The PR is ready to be reviewed again. Thanks :) |
seiko2plus
left a comment
There was a problem hiding this comment.
Necessary improvements to the x86 architecture
0bb6b36 to
1134be2
Compare
02e9694 to
09b22a1
Compare
|
We have some errors in the CI but I am not sure if they are related to the latest changes. |
seiko2plus
left a comment
There was a problem hiding this comment.
Thank you, we're almost done here just few changes more.
|
See below the results updated: Benchmark/Performance:SSEAVX2AVX512FAVX512BWVSX3Binary size:
|
seiko2plus
left a comment
There was a problem hiding this comment.
Well done!, just one thing left.
| def time_less_than_binary(self, dtype): | ||
| (self.x < self.y) | ||
|
|
||
| def time_less_than_scalar1(self, dtype): | ||
| (self.s < self.x) | ||
|
|
||
| def time_less_than_scalar2(self, dtype): | ||
| (self.d < 1) | ||
| (self.x < self.s) |
There was a problem hiding this comment.
@mattip, benchmarks here doesn't cover the rest of operations, not sure if its necessary to cover them or its better to save a room for the upcoming benchmarks since its the benchmark process start to became pretty slow.
There was a problem hiding this comment.
+1 for minimal effective benchmarks. I don't think there is a need to repeatedly hit the same code multiple times
|
@rafaelcfsousa, its better to strip the debugging symbols before reporting the binary size. also I think we need a release note. |
On the one hand, there are hints in 'numpy/doc/release/upcoming_changes'. On the other, I see we are not very diligent about adding release notes to other PRs for SIMD performance. |
This commit also applies some techniques to reduce the size of the binary generated from the source loops_comparison.dispatch.c.src
This commit also rewrite the tests andc, orc and xnor
04f6803 to
dc4a9e3
Compare
PR numpy#21483 improves the execution time of the comparison functions by using universal intrinsics
The PR numpy#21483 improves the execution time of the comparison functions by using universal intrinsics
1efb0b8 to
1969089
Compare
The PR numpy#21483 improves the execution time of the comparison functions by using universal intrinsics
1969089 to
2aabbb8
Compare
The PR numpy#21483 improves the execution time of the comparison functions by using universal intrinsics
2aabbb8 to
2701a5a
Compare
@seiko2plus :
|
|
@seiko2plus and @mattip : |
|
Thanks @rafaelcfsousa and thanks @seiko2plus for the careful review |
The PR numpy#21483 improves the execution time of the comparison functions by using universal intrinsics
This PR moves the comparison functions (eq, ne, lt, le, ge and gt) to a new dispatchable source file to make use of the universal intrinsics. This optimization benefits all architectures.
The following universal intrinsics were added in this PR:
Benchmark:
X86
CPU
OS
Benchmark
SSE
AVX2
AVX512F
AVX512BW
Power little-endian (Power9/VSX3)
CPU
OS
Benchmark
VSX3
AArch64
As I do not have access to an ARM processor, I didn't execute the benchmark to check performance. But I was able to test the NEON code I added in this PR on a small ARM processor.
cc: @mattip and @seiko2plus