SIMD: Optimize the performance of np.packbits in AVX2/AVX512F/VSX.#17102
SIMD: Optimize the performance of np.packbits in AVX2/AVX512F/VSX.#17102mattip merged 29 commits intonumpy:masterfrom
Conversation
|
Hang on, this is not a ufunc. How will it dispatch? We have to use the cpu-baseline flags for this function, which I think do not include AVX2. |
|
@mattip, The new dispatcher is already involved, explained as following: @Qiyu8, 1. moved the old SIMD loop into a new dispatch-able source
|
|
yes, the first step explains why AVX2 is enabled without the cpu-baseline flags, we defined new baseline in front of the dispatch file. |
eric-wieser
left a comment
There was a problem hiding this comment.
See above comments about use of .h files
seiko2plus
left a comment
There was a problem hiding this comment.
please could you replace it with npyv_tobits_b8 within #17789, it should perform better on aarch64.
|
@seiko2plus The new intrinsics really improved a lot in performance, The running time reduced by a staggering 93%, Here is the lastest benchmark result: AVX2 enabled
|
|
Does that increase make sense? Maybe it is somehow skipping the execution. Does the benchmark check that the result is correct? |
|
@mattip AFAIK, The benchmark only checks the performance, the correctness should verified through a mass of test cases. The log was printed in the simd loop when running the test case, so I'm sure the execution process is not skipped, Do you suggest to import a new benchmark library such as google benchmark instead of asv in order to get a more convincing result? |
|
Thanks for the clarification. Is a 14x speed increase expected? My intuition, which is frequently wrong, is that if something is too good to be true it means there is a bug. |
|
@mattip, Msvc and bureaucracy are two sides of the same coin, they exist to kill the performance :). |
seiko2plus
left a comment
There was a problem hiding this comment.
Please, Could you add a benchmark test for little order?
numpy/benchmarks/benchmarks/bench_core.py
Line 165 in c9f9081
| bb[2] = npyv_tobits_b8(npyv_cmpneq_u8(v2, v_zero)); | ||
| bb[3] = npyv_tobits_b8(npyv_cmpneq_u8(v3, v_zero)); | ||
| if(out_stride == 1 && | ||
| (!NPY_STRONG_ALIGNMENT || npy_is_aligned(outptr, sizeof(npy_uint64)))) { |
There was a problem hiding this comment.
no need to iterate npy_is_aligned(outptr, sizeof(npy_uint64)), just one call is needed before the loop.
| } else { | ||
| for(int i = 0; i < 4; i++) { | ||
| for (int j = 0; j < vstep; j++) { | ||
| memcpy(outptr, (char*)&bb[i] + j, 1); |
There was a problem hiding this comment.
I know the compiler gonna optimize it but why do we need memcpy for storing one byte?
There was a problem hiding this comment.
we have npyv_storen_till_u32 now, Do you suggest to add npyv_storen_till_u8 for one byte non-contiguous partial store? IMO, small size memcpy optimization can be treated as a special optimization in follow-up PRs, This function is called in many places.
There was a problem hiding this comment.
non-contiguous/partial memory access intrinsics for "u8, s8, u16, s16" will be useful for other SIMD kernels but not this one. I just suggested storing each bye via dereferencing the output ptr.
|
@mattip I fully understand why you surprised about the performance improvement, but after multiple log tests, I believe that the result is correct and here is the latest benchmark result(for both little order and big order): |
|
Cool. Looks like a clear win. |
|
Thanks @Qiyu8 |
| #endif | ||
| #if !defined(NPY_STRONG_ALIGNMENT) | ||
| #define NPY_STRONG_ALIGNMENT 0 | ||
| #endif |
There was a problem hiding this comment.
I just noticed this addition. Is this the same (idea) as NPY_CPU_HAVE_UNALIGNED_ACCESS, which we do already use in a handful of places? I am a bit worried of duplicating this logic, especially since this seems to assume that all CPUs have unaligned access instead of the opposite of assuming x86 and amd64 always supporting unaligned access?
There was a problem hiding this comment.
Thanks for pointing this out, in this case, only armv7 needs aligned access, which assume that other CPUs supporting unaligned access, so NPY_CPU_HAVE_UNALIGNED_ACCESS and NPY_STRONG_ALIGNMENT are like two sides of a coin, you are right about the duplicating logic, will integrate together in follow-up PR.
np.packbitshas already optimized by intrinsics in SSE&NEON, It's can be easily extends to AVX2 by using universal intrinsics.Here is the Benchmark results:
/arch o2in SSE/NEON platform the performance are not significantly changed. here is The size change of
_multiarray_umath.cp37-win_amd64.pyd