SIMD: Optimize the performance of np.packbits in AVX2/AVX512F/VSX. by Qiyu8 · Pull Request #17102 · numpy/numpy

Qiyu8 · 2020-08-19T07:10:32Z

np.packbits has already optimized by intrinsics in SSE&NEON, It's can be easily extends to AVX2 by using universal intrinsics.
Here is the Benchmark results:

X86-AVX2 enabled under MSVC Compiler (version 14.26.28801), with the args /arch o2

       before           after         ratio
     [7b7e7fe4]       [c480ffba]
     <master>         <usimd-compiled>
-      16.8±0.4μs       10.1±0.2μs     0.60  bench_core.PackBits.time_packbits(<class 'bool'>)
-         306±8μs          172±1μs     0.56  bench_core.PackBits.time_packbits_axis1(<class 'bool'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

in SSE/NEON platform the performance are not significantly changed. here is The size change of _multiarray_umath.cp37-win_amd64.pyd

master	usimd-compiled
2629KB	2703KB

Merge branch 'usimd-compiled' of github.com:Qiyu8/numpy into usimd-compiled

numpy/core/src/multiarray/compiled_base.dispatch.c

numpy/core/src/common/simd/neon/conversion.h

numpy/core/src/multiarray/compiled_base.dispatch.c

mattip · 2020-08-20T13:12:41Z

Hang on, this is not a ufunc. How will it dispatch? We have to use the cpu-baseline flags for this function, which I think do not include AVX2.

numpy/core/src/multiarray/compiled_base.c

seiko2plus · 2020-08-20T13:31:38Z

@mattip, The new dispatcher is already involved, explained as following:

@Qiyu8, 1. moved the old SIMD loop into a new dispatch-able source `compiled_base.dispatch.c`

numpy/numpy/core/src/multiarray/compiled_base.dispatch.c

Lines 1 to 15 in 81e08d9

    
           /** 
        
            * @targets $maxopt baseline 
        
            * SSE2 AVX2 
        
            * NEON ASIMDDP 
        
            */ 
        
           #include "compiled_base.h" 
        
           /* 
        
            * This function packs boolean values in the input array into the bits of a 
        
            * byte array. Truth values are determined as usual: 0 is false, everything 
        
            * else is true. 
        
            */ 
        
           NPY_NO_EXPORT void NPY_CPU_DISPATCH_CURFX(compiled_base_pack_inner) 
        
           (const char *inptr, npy_intp element_size,  npy_intp n_in, npy_intp in_stride, char *outptr, npy_intp n_out, npy_intp out_stride, char order) 
        
           {

2. forward declarations for the exported function in here

numpy/numpy/core/src/multiarray/compiled_base.h

Lines 21 to 26 in 81e08d9

    
           #ifndef NPY_DISABLE_OPTIMIZATION 
        
               #include "compiled_base.dispatch.h" 
        
           #endif 
        
           NPY_CPU_DISPATCH_DECLARE(NPY_NO_EXPORT void compiled_base_pack_inner, 
        
           (const char *inptr, npy_intp element_size, npy_intp n_in, npy_intp in_stride, char *outptr, npy_intp n_out, npy_intp out_stride, char order))

3. the runtime call code implemented in the place of old SIMD code

numpy/numpy/core/src/multiarray/compiled_base.c

Lines 1488 to 1503 in 81e08d9

    
           static NPY_INLINE void 
        
           pack_inner(const char *inptr, 
        
                      npy_intp element_size,   /* in bytes */ 
        
                      npy_intp n_in, 
        
                      npy_intp in_stride, 
        
                      char *outptr, 
        
                      npy_intp n_out, 
        
                      npy_intp out_stride, 
        
                      char order) 
        
           { 
        
               #ifndef NPY_DISABLE_OPTIMIZATION 
        
                   #include "compiled_base.dispatch.h" 
        
               #endif 
        
               NPY_CPU_DISPATCH_CALL(return compiled_base_pack_inner, 
        
               (inptr, element_size, n_in, in_stride, outptr, n_out, out_stride, order)) 
        
           }

Qiyu8 · 2020-08-21T01:26:46Z

yes, the first step explains why AVX2 is enabled without the cpu-baseline flags, we defined new baseline in front of the dispatch file.

* @targets $maxopt baseline 
  * SSE2 AVX2

numpy/core/src/multiarray/compiled_base.dispatch.c

numpy/core/src/multiarray/compiled_base.h

numpy/core/src/multiarray/compiled_base.c

eric-wieser

See above comments about use of .h files

seiko2plus

please could you replace it with npyv_tobits_b8 within #17789, it should perform better on aarch64.

numpy/core/src/common/simd/avx2/operators.h

numpy/core/src/common/simd/avx512/operators.h

numpy/core/src/common/simd/neon/operators.h

numpy/core/src/common/simd/sse/operators.h

numpy/core/src/common/simd/vsx/operators.h

numpy/core/src/multiarray/compiled_base.c

Qiyu8 · 2020-12-11T06:59:24Z

@seiko2plus The new intrinsics really improved a lot in performance, The running time reduced by a staggering 93%, Here is the lastest benchmark result:

       before           after         ratio
     [91d9bbeb]       [711482a8]
     <master>         <usimd-compiled>
-        19.4±1μs       3.51±0.6μs     0.18  bench_core.PackBits.time_packbits(<class 'bool'>)
-        347±10μs         25.4±2μs     0.07  bench_core.PackBits.time_packbits_axis1(<class 'bool'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

AVX2 enabled


· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython
·· Building 711482a8  for virtualenv-py3.7-Cython
·· Installing 711482a8  into virtualenv-py3.7-Cython
· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For numpy commit 91d9bbeb  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  8.33%] ··· Running (bench_core.PackBits.time_packbits--)...
[ 25.00%] · For numpy commit 711482a8  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 33.33%] ··· Running (bench_core.PackBits.time_packbits--)...
[ 50.00%] · For numpy commit 711482a8  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 58.33%] ··· bench_core.PackBits.time_packbits                                                                                                                                       ok
[ 58.33%] ··· ============== ============
                  dtype
              -------------- ------------
                   bool       3.51±0.6μs
               numpy.uint64   91.0±20μs
              ============== ============
[ 66.67%] ··· bench_core.PackBits.time_packbits_axis0                                                                                                                                 ok

[ 66.67%] ··· ============== =============

dtype

-------------- -------------

bool         378±4μs

numpy.uint64   1.42±0.04ms

============== =============
[ 75.00%] ··· bench_core.PackBits.time_packbits_axis1                                                                                                                                 ok

[ 75.00%] ··· ============== =============

dtype

-------------- -------------

bool         25.4±2μs

numpy.uint64   1.23±0.05ms

============== =============
[ 75.00%] · For numpy commit 91d9bbeb  (round 2/2):

[ 75.00%] ·· Building for virtualenv-py3.7-Cython

[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython

[ 83.33%] ··· bench_core.PackBits.time_packbits                                                                                                                                       ok

[ 83.33%] ··· ============== ==========

dtype

-------------- ----------

bool       19.4±1μs

numpy.uint64   62.0±1μs

============== ==========
[ 91.67%] ··· bench_core.PackBits.time_packbits_axis0                                                                                                                                 ok

[ 91.67%] ··· ============== =============

dtype

-------------- -------------

bool         370±20μs

numpy.uint64   1.41±0.03ms

============== =============
[100.00%] ··· bench_core.PackBits.time_packbits_axis1                                                                                                                                 ok

[100.00%] ··· ============== =============

dtype

-------------- -------------

bool         347±10μs

numpy.uint64   1.31±0.05ms

============== =============
   before           after         ratio
 [91d9bbeb]       [711482a8]
 <master>         <usimd-compiled>



   19.4±1μs       3.51±0.6μs     0.18  bench_core.PackBits.time_packbits(<class 'bool'>)



   347±10μs         25.4±2μs     0.07  bench_core.PackBits.time_packbits_axis1(<class 'bool'>)



SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

PERFORMANCE INCREASED.

mattip · 2020-12-11T07:12:56Z

Does that increase make sense? Maybe it is somehow skipping the execution. Does the benchmark check that the result is correct?

Qiyu8 · 2020-12-14T01:45:47Z

@mattip AFAIK, The benchmark only checks the performance, the correctness should verified through a mass of test cases. The log was printed in the simd loop when running the test case, so I'm sure the execution process is not skipped, Do you suggest to import a new benchmark library such as google benchmark instead of asv in order to get a more convincing result?

mattip · 2020-12-14T04:13:39Z

Thanks for the clarification. Is a 14x speed increase expected? My intuition, which is frequently wrong, is that if something is too good to be true it means there is a bug.

numpy/core/src/multiarray/compiled_base.c

seiko2plus · 2020-12-14T06:15:20Z

@mattip, Msvc and bureaucracy are two sides of the same coin, they exist to kill the performance :).
The difference that makes this boost is swapping the order of bytes(big-endian) via npyv_rev64_u8, I was expected 2x to 3x maximum change on GCC but yeah MSVC always have a bad gap in auto-vectorization but I think that issue can be reduced with the flags /arch:AVX and /O2 which can be achieved here by #17736 and --cpu-baseline=avx2.

seiko2plus

Please, Could you add a benchmark test for little order?

numpy/benchmarks/benchmarks/bench_core.py

Line 165 in c9f9081

def time_packbits(self, dtype):

seiko2plus · 2020-12-16T03:56:20Z

numpy/core/src/multiarray/compiled_base.c

+            bb[2] = npyv_tobits_b8(npyv_cmpneq_u8(v2, v_zero));
+            bb[3] = npyv_tobits_b8(npyv_cmpneq_u8(v3, v_zero));
+            if(out_stride == 1 && 
+                (!NPY_STRONG_ALIGNMENT || npy_is_aligned(outptr, sizeof(npy_uint64)))) {


no need to iterate npy_is_aligned(outptr, sizeof(npy_uint64)), just one call is needed before the loop.

seiko2plus · 2020-12-16T03:57:50Z

numpy/core/src/multiarray/compiled_base.c

+            } else {
+                for(int i = 0; i < 4; i++) {
+                    for (int j = 0; j < vstep; j++) {
+                        memcpy(outptr, (char*)&bb[i] + j, 1);


I know the compiler gonna optimize it but why do we need memcpy for storing one byte?

we have npyv_storen_till_u32 now, Do you suggest to add npyv_storen_till_u8 for one byte non-contiguous partial store? IMO, small size memcpy optimization can be treated as a special optimization in follow-up PRs, This function is called in many places.

non-contiguous/partial memory access intrinsics for "u8, s8, u16, s16" will be useful for other SIMD kernels but not this one. I just suggested storing each bye via dereferencing the output ptr.

Qiyu8 · 2020-12-16T08:42:28Z

@mattip I fully understand why you surprised about the performance improvement, but after multiple log tests, I believe that the result is correct and here is the latest benchmark result(for both little order and big order):

       before           after         ratio
     [9e26d1d2]       [b156231e]
     <master>         <usimd-compiled>
-        86.5±2μs         73.8±3μs     0.85  bench_core.PackBits.time_packbits_little(<class 'numpy.uint64'>)
-        21.2±6μs       3.40±0.3μs     0.16  bench_core.PackBits.time_packbits(<class 'bool'>)
-        23.3±2μs      2.74±0.07μs     0.12  bench_core.PackBits.time_packbits_little(<class 'bool'>)
-         346±9μs       31.0±0.3μs     0.09  bench_core.PackBits.time_packbits_axis1(<class 'bool'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

mattip · 2020-12-16T08:51:36Z

Cool. Looks like a clear win.

seiko2plus

Well done, Thank you

mattip · 2020-12-19T17:18:34Z

Thanks @Qiyu8

seberg · 2020-12-24T00:02:57Z

numpy/core/include/numpy/npy_common.h

+#endif
+#if !defined(NPY_STRONG_ALIGNMENT)
+#define NPY_STRONG_ALIGNMENT 0
+#endif


I just noticed this addition. Is this the same (idea) as NPY_CPU_HAVE_UNALIGNED_ACCESS, which we do already use in a handful of places? I am a bit worried of duplicating this logic, especially since this seems to assume that all CPUs have unaligned access instead of the opposite of assuming x86 and amd64 always supporting unaligned access?

Thanks for pointing this out, in this case, only armv7 needs aligned access, which assume that other CPUs supporting unaligned access, so NPY_CPU_HAVE_UNALIGNED_ACCESS and NPY_STRONG_ALIGNMENT are like two sides of a coin, you are right about the duplicating logic, will integrate together in follow-up PR.

Qiyu8 added 5 commits August 11, 2020 09:52

use usimd to reconstruct PackInner

f145554

update

13ee23c

Merge branch 'usimd-compiled' of github.com:Qiyu8/numpy into usimd-compiled

add neon adapt

5baccbc

add AVX2 dispatch

748b21b

add allowlist

c30f729

mattip reviewed Aug 19, 2020

View reviewed changes

numpy/core/src/multiarray/compiled_base.dispatch.c Outdated Show resolved Hide resolved

eric-wieser reviewed Aug 19, 2020

View reviewed changes

numpy/core/src/multiarray/compiled_base.dispatch.c Outdated Show resolved Hide resolved

eric-wieser reviewed Aug 19, 2020

View reviewed changes

numpy/core/src/multiarray/compiled_base.dispatch.c Outdated Show resolved Hide resolved

merge declarations and initialization

72a5a89

seiko2plus reviewed Aug 20, 2020

View reviewed changes

numpy/core/src/common/simd/neon/conversion.h Outdated Show resolved Hide resolved

eric-wieser reviewed Aug 20, 2020

View reviewed changes

numpy/core/src/multiarray/compiled_base.dispatch.c Outdated Show resolved Hide resolved

eric-wieser reviewed Aug 20, 2020

View reviewed changes

numpy/core/src/multiarray/compiled_base.dispatch.c Outdated Show resolved Hide resolved

remove npyv_cvt and fix typos.

81e08d9

seiko2plus requested changes Aug 20, 2020

View reviewed changes

seiko2plus reviewed Aug 20, 2020

View reviewed changes

numpy/core/src/multiarray/compiled_base.c Outdated Show resolved Hide resolved

Qiyu8 added 01 - Enhancement component: numpy._core labels Aug 21, 2020

add AVX512/VSX support and small optimize

07a584b

Qiyu8 changed the title ~~USIMD: Optimize the performance of np.packbits in AVX2.~~ USIMD: Optimize the performance of np.packbits in AVX2/AVX512F/VSX. Aug 21, 2020

remove blankline and add convert in NEON/VSX

26c6ec3

seiko2plus reviewed Aug 21, 2020

View reviewed changes

numpy/core/src/multiarray/compiled_base.dispatch.c Outdated Show resolved Hide resolved

eric-wieser reviewed Aug 21, 2020

View reviewed changes

numpy/core/src/multiarray/compiled_base.h Show resolved Hide resolved

eric-wieser reviewed Aug 21, 2020

View reviewed changes

numpy/core/src/multiarray/compiled_base.c Show resolved Hide resolved

eric-wieser requested changes Aug 21, 2020

View reviewed changes

seiko2plus mentioned this pull request Aug 22, 2020

MAINT: refactor _Distutils out of CCompiler_Opt mattip/numpy#46

Closed

charris changed the title ~~USIMD: Optimize the performance of np.packbits in AVX2/AVX512F/VSX.~~ SIMD: Optimize the performance of np.packbits in AVX2/AVX512F/VSX. Aug 24, 2020

split headers according to three rules

36d2f9f

Qiyu8 added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Nov 12, 2020

seiko2plus mentioned this pull request Nov 16, 2020

ENH, SIMD: Add new NPYV intrinsics pack(0) #17789

Merged

1 task

seiko2plus reviewed Nov 17, 2020

View reviewed changes

numpy/core/src/multiarray/compiled_base.c Outdated Show resolved Hide resolved

seberg added triaged Issue/PR that was discussed in a triage meeting and removed triage review Issue/PR to be discussed at the next triage meeting labels Nov 18, 2020

Qiyu8 added 3 commits December 11, 2020 10:14

Merge branch 'master' of github.com:numpy/numpy into usimd-compiled

567056e

Force AVX2 in order to test the performance by using new intrinsics.

711482a

remove useless intrinsics and forced SIMD.

95cd706

remove extra headers.

0ece23b

Qiyu8 mentioned this pull request Dec 14, 2020

MAINT: Optimize the performance of count_nonzero by using universal intrinsics #17958

Merged

seiko2plus reviewed Dec 14, 2020

View reviewed changes

numpy/core/src/multiarray/compiled_base.c Show resolved Hide resolved

numpy/core/src/multiarray/compiled_base.c Outdated Show resolved Hide resolved

Qiyu8 added 4 commits December 14, 2020 15:56

get rid of memcpy.

3768b8a

Merge branch 'master' of github.com:numpy/numpy into usimd-compiled

797477d

add ARMV7 macro define header in order to prevent bus error.

a08acd4

add strong alignment definition.which can be used in other areas.

b5c5ad8

seiko2plus reviewed Dec 16, 2020

View reviewed changes

add benchmark test case for little order.

b156231

seiko2plus approved these changes Dec 18, 2020

View reviewed changes

mattip merged commit 3095b43 into numpy:master Dec 19, 2020

seberg reviewed Dec 24, 2020

View reviewed changes

rgommers mentioned this pull request Sep 11, 2021

ENH: Add SIMD operation copysign #19770

Closed

Uh oh!

Conversation

Qiyu8 commented Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mattip commented Aug 20, 2020

Uh oh!

Uh oh!

seiko2plus commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

@Qiyu8, 1. moved the old SIMD loop into a new dispatch-able source compiled_base.dispatch.c

2. forward declarations for the exported function in here

3. the runtime call code implemented in the place of old SIMD code

Uh oh!

Qiyu8 commented Aug 21, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eric-wieser left a comment

Choose a reason for hiding this comment

Uh oh!

seiko2plus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Qiyu8 commented Dec 11, 2020

Uh oh!

mattip commented Dec 11, 2020

Uh oh!

Qiyu8 commented Dec 14, 2020

Uh oh!

mattip commented Dec 14, 2020

Uh oh!

Uh oh!

Uh oh!

seiko2plus commented Dec 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seiko2plus left a comment

Choose a reason for hiding this comment

Uh oh!

seiko2plus Dec 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seiko2plus Dec 16, 2020

Choose a reason for hiding this comment

Uh oh!

Qiyu8 Dec 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seiko2plus Dec 18, 2020

Choose a reason for hiding this comment

Uh oh!

Qiyu8 commented Dec 16, 2020

Uh oh!

mattip commented Dec 16, 2020

Uh oh!

seiko2plus left a comment

Choose a reason for hiding this comment

Uh oh!

mattip commented Dec 19, 2020

Uh oh!

Qiyu8 commented Aug 19, 2020 •

edited

Loading

seiko2plus commented Aug 20, 2020 •

edited

Loading

@Qiyu8, 1. moved the old SIMD loop into a new dispatch-able source `compiled_base.dispatch.c`

seiko2plus commented Dec 14, 2020 •

edited

Loading

seiko2plus Dec 16, 2020 •

edited

Loading

Qiyu8 Dec 16, 2020 •

edited

Loading