ENH, SIMD: improve argmax/argmin performance by seiko2plus · Pull Request #20846 · numpy/numpy

seiko2plus · 2022-01-18T03:32:04Z

for all integers, f32 and f64 data types on all
supported architectures via universal intrinsics.

related to #20785, #20131

X86

CPU

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              2
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:                        4
CPU MHz:                         3410.808
BogoMIPS:                        5999.99
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        2 MiB
L3 cache:                        24.8 MiB
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant
                                 _tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
                                 tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep b
                                 mi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat
                                 pku ospke

OS

Linux ip-172-31-32-40 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:03:59 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Python 3.8.10
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

AVX512_SKX

unset NPY_DISABLE_CPU_FEATURES
python runtests.py -n --bench-compare parent/main "argmax|argmin" -- --sort ratio

       before           after         ratio
     [4b985e2b]       [c99921ad]
     <simd_argmaxmin~1>       <simd_argmaxmin>
-        1.66±0ms         1.54±0ms     0.93  bench_lib.Nan.time_nanargmax(200000, 50.0)
-         926±3μs          850±3μs     0.92  bench_lib.Nan.time_nanargmax(200000, 90.0)
-        1.72±0ms         1.55±0ms     0.90  bench_lib.Nan.time_nanargmin(200000, 50.0)
-         380±3μs        325±0.4μs     0.86  bench_lib.Nan.time_nanargmax(200000, 0.1)
-         993±4μs          849±3μs     0.85  bench_lib.Nan.time_nanargmin(200000, 90.0)
-         378±3μs        323±0.2μs     0.85  bench_lib.Nan.time_nanargmax(200000, 0)
-         482±3μs        406±0.8μs     0.84  bench_lib.Nan.time_nanargmax(200000, 2.0)
-         548±3μs        407±0.8μs     0.74  bench_lib.Nan.time_nanargmin(200000, 2.0)
-         444±3μs        326±0.9μs     0.73  bench_lib.Nan.time_nanargmin(200000, 0.1)
-         441±3μs        322±0.9μs     0.73  bench_lib.Nan.time_nanargmin(200000, 0)
-     6.49±0.08μs      4.45±0.01μs     0.68  bench_reduce.ArgMax.time_argmax(<class 'bool'>)
-       182±0.3μs       59.8±0.4μs     0.33  bench_reduce.ArgMin.time_argmin(<class 'numpy.float64'>)
-       182±0.1μs       59.6±0.6μs     0.33  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint64'>)
-      182±0.07μs       59.2±0.2μs     0.33  bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>)
-       124±0.4μs        26.8±20μs     0.22  bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>)
-       122±0.4μs       18.9±0.2μs     0.16  bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>)
-       180±0.2μs       18.8±0.1μs     0.10  bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>)
-       180±0.2μs       18.6±0.4μs     0.10  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>)
-         202±4μs        19.1±20μs     0.09  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>)
-         204±4μs        19.1±20μs     0.09  bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>)
-       120±0.1μs      9.56±0.04μs     0.08  bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>)
-       120±0.2μs      9.54±0.02μs     0.08  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>)
-         218±3μs         13.7±2μs     0.06  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
-         193±5μs         11.0±4μs     0.06  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>)
-         212±2μs         11.0±3μs     0.05  bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>)
-      179±0.07μs      6.15±0.02μs     0.03  bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>)
-      179±0.05μs      6.12±0.02μs     0.03  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>)
-       218±0.7μs         7.07±1μs     0.03  bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>)
-         219±1μs         7.05±1μs     0.03  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>)
-         190±2μs      6.01±0.02μs     0.03  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>)
-         191±3μs      6.02±0.01μs     0.03  bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)

AVX2

export NPY_DISABLE_CPU_FEATURES="AVX512F AVX512_SKX"
python runtests.py -n --bench-compare parent/main "argmax|argmin" -- --sort ratio

       before           after         ratio
     [4b985e2b]       [c99921ad]
     <simd_argmaxmin~1>       <simd_argmaxmin>
-         964±3μs         895±10μs     0.93  bench_lib.Nan.time_nanargmin(200000, 90.0)
-        1.65±0ms         1.53±0ms     0.93  bench_lib.Nan.time_nanargmin(200000, 50.0)
-         365±1μs        332±0.9μs     0.91  bench_lib.Nan.time_nanargmax(200000, 0)
-       370±0.5μs          336±1μs     0.91  bench_lib.Nan.time_nanargmax(200000, 0.1)
-     6.49±0.06μs      5.80±0.04μs     0.89  bench_reduce.ArgMax.time_argmax(<class 'bool'>)
-         470±2μs          417±1μs     0.89  bench_lib.Nan.time_nanargmax(200000, 2.0)
-         532±2μs        418±0.5μs     0.79  bench_lib.Nan.time_nanargmin(200000, 2.0)
-       432±0.8μs        335±0.6μs     0.78  bench_lib.Nan.time_nanargmin(200000, 0.1)
-         428±2μs          332±1μs     0.77  bench_lib.Nan.time_nanargmin(200000, 0)
-       124±0.1μs         61.4±7μs     0.50  bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>)
-      182±0.09μs       75.1±0.3μs     0.41  bench_reduce.ArgMin.time_argmin(<class 'numpy.float64'>)
-      182±0.08μs       71.4±0.3μs     0.39  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint64'>)
-      182±0.09μs       66.2±0.3μs     0.36  bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>)
-         202±4μs         59.5±6μs     0.29  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>)
-       122±0.3μs       35.3±0.3μs     0.29  bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>)
-         203±3μs         49.7±8μs     0.25  bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>)
-      180±0.09μs       32.5±0.3μs     0.18  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>)
-       180±0.1μs       28.8±0.2μs     0.16  bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>)
-         198±4μs         30.0±2μs     0.15  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>)
-         217±4μs         32.0±2μs     0.15  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
-       120±0.2μs      17.3±0.05μs     0.14  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>)
-       120±0.1μs      15.2±0.02μs     0.13  bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>)
-         213±5μs         25.2±1μs     0.12  bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>)
-       218±0.7μs       16.4±0.2μs     0.08  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>)
-         217±2μs      15.3±0.03μs     0.07  bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>)
-      179±0.05μs      9.81±0.04μs     0.05  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>)
-         193±4μs      9.80±0.04μs     0.05  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>)
-      179±0.07μs      8.47±0.01μs     0.05  bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>)
-         193±3μs      8.64±0.03μs     0.04  bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)

SSE42

export NPY_DISABLE_CPU_FEATURES="AVX2 AVX512F AVX512_SKX"
python runtests.py -n --bench-compare parent/main "argmax|argmin" -- --sort ratio

       before           after         ratio
     [4b985e2b]       [c99921ad]
     <simd_argmaxmin~1>       <simd_argmaxmin>
-        1.60±0ms         1.52±0ms     0.95  bench_lib.Nan.time_nanargmax(200000, 50.0)
-       470±0.7μs          442±4μs     0.94  bench_lib.Nan.time_nanargmax(200000, 2.0)
-     6.55±0.08μs      6.05±0.02μs     0.92  bench_reduce.ArgMax.time_argmax(<class 'bool'>)
-        1.66±0ms         1.51±0ms     0.91  bench_lib.Nan.time_nanargmin(200000, 50.0)
-         964±2μs          865±2μs     0.90  bench_lib.Nan.time_nanargmin(200000, 90.0)
-       430±0.9μs          359±3μs     0.84  bench_lib.Nan.time_nanargmin(200000, 0)
-         433±1μs          360±3μs     0.83  bench_lib.Nan.time_nanargmin(200000, 0.1)
-         534±1μs          441±3μs     0.83  bench_lib.Nan.time_nanargmin(200000, 2.0)
-       124±0.1μs         95.6±3μs     0.77  bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>)
-       182±0.1μs        102±0.9μs     0.56  bench_reduce.ArgMin.time_argmin(<class 'numpy.float64'>)
-       182±0.1μs        101±0.4μs     0.55  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint64'>)
-       182±0.2μs       81.6±0.2μs     0.45  bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>)
-         122±4μs       54.3±0.4μs     0.44  bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>)
-        213±10μs         89.3±6μs     0.42  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>)
-         209±8μs         72.9±6μs     0.35  bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>)
-       180±0.2μs       48.3±0.2μs     0.27  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>)
-         218±3μs         48.4±2μs     0.22  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
-      180±0.06μs       39.3±0.4μs     0.22  bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>)
-        209±10μs         45.2±1μs     0.22  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>)
-         122±2μs      24.1±0.03μs     0.20  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>)
-         219±4μs         38.7±1μs     0.18  bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>)
-         120±2μs      19.4±0.01μs     0.16  bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>)
-       219±0.5μs      24.2±0.05μs     0.11  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>)
-       219±0.9μs      20.5±0.04μs     0.09  bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>)
-      179±0.05μs      13.2±0.01μs     0.07  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>)
-        207±10μs      13.3±0.03μs     0.06  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>)
-       179±0.1μs      11.0±0.02μs     0.06  bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>)
-        209±10μs      11.1±0.02μs     0.05  bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)

BASELINE(SSE3)

export NPY_DISABLE_CPU_FEATURES="SSE42 AVX2 AVX512F AVX512_SKX"
python runtests.py -n --bench-compare parent/main "argmax|argmin" -- --sort ratio

       before           after         ratio
     [4b985e2b]       [c99921ad]
     <simd_argmaxmin~1>       <simd_argmaxmin>
-        1.66±0ms      1.54±0.01ms     0.93  bench_lib.Nan.time_nanargmin(200000, 50.0)
-         198±5μs          183±1μs     0.93  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>)
-         967±3μs          883±2μs     0.91  bench_lib.Nan.time_nanargmin(200000, 90.0)
-         433±2μs          376±4μs     0.87  bench_lib.Nan.time_nanargmin(200000, 0)
-         438±2μs          377±3μs     0.86  bench_lib.Nan.time_nanargmin(200000, 0.1)
-       124±0.3μs          106±6μs     0.86  bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>)
-         535±3μs          457±4μs     0.85  bench_lib.Nan.time_nanargmin(200000, 2.0)
-       182±0.2μs        153±0.6μs     0.84  bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>)
-         202±3μs          150±1μs     0.75  bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>)
-       182±0.2μs          115±2μs     0.63  bench_reduce.ArgMin.time_argmin(<class 'numpy.float64'>)
-       122±0.4μs       64.2±0.9μs     0.53  bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>)
-       180±0.1μs       58.4±0.6μs     0.32  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>)
-       180±0.2μs       50.4±0.8μs     0.28  bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>)
-         197±7μs         55.0±2μs     0.28  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>)
-         218±3μs         58.4±3μs     0.27  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
-       121±0.2μs       29.6±0.5μs     0.25  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>)
-         213±2μs         47.0±1μs     0.22  bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>)
-       120±0.3μs       24.0±0.4μs     0.20  bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>)
-       219±0.4μs       29.3±0.2μs     0.13  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>)
-       219±0.9μs       23.7±0.2μs     0.11  bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>)
-       179±0.2μs       17.7±0.1μs     0.10  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>)
-         192±1μs      17.7±0.08μs     0.09  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>)
-       179±0.1μs      15.4±0.06μs     0.09  bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>)
-         188±2μs      14.8±0.08μs     0.08  bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)

Power little-endian

CPU

Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Model:                           2.2 (pvr 004e 1202)
Model name:                      POWER9 (architected), altivec supported
L1d cache:                       256 KiB
L1i cache:                       256 KiB
NUMA node0 CPU(s):               0-7
Vulnerability L1tf:              Not affected
Vulnerability Meltdown:          Mitigation; RFI Flush
Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable

processor   : 7
cpu     : POWER9 (architected), altivec supported
clock       : 2200.000000MHz
revision    : 2.2 (pvr 004e 1202)

timebase    : 512000000
platform    : pSeries
model       : IBM pSeries (emulated by qemu)
machine     : CHRP IBM pSeries (emulated by qemu)
MMU     : Radix

OS

Linux e517009a912a 4.19.0-2-powerpc64le #1 SMP Debian 4.19.16-1 (2019-01-17) ppc64le ppc64le ppc64le GNU/Linux
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

baseline(VSX2)

python runtests.py -n --bench-compare parent/main "argmax|argmin" -- --sort ratio

       before           after         ratio
     [4b985e2b]       [c99921ad]
     <simd_argmaxmin~1>       <simd_argmaxmin>
-     1.90±0.01ms         1.78±0ms     0.93  bench_lib.Nan.time_nanargmin(200000, 90.0)
-       294±0.6μs        275±0.9μs     0.93  bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>)
-     1.10±0.01ms          997±7μs     0.90  bench_lib.Nan.time_nanargmin(200000, 2.0)
-         896±6μs          778±2μs     0.87  bench_lib.Nan.time_nanargmin(200000, 0.1)
-        885±10μs          767±2μs     0.87  bench_lib.Nan.time_nanargmin(200000, 0)
-         373±2μs        275±0.3μs     0.74  bench_reduce.ArgMin.time_argmin(<class 'numpy.float64'>)
-       219±0.2μs        156±0.5μs     0.71  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>)
-         219±2μs          154±1μs     0.70  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>)
-      218±0.08μs        152±0.3μs     0.70  bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>)
-       218±0.2μs       152±0.08μs     0.70  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint64'>)
-       218±0.3μs          148±2μs     0.68  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>)
-       219±0.8μs          149±2μs     0.68  bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>)
-      237±0.09μs        156±0.4μs     0.66  bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>)
-         235±1μs          155±1μs     0.66  bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>)
-       329±0.2μs        178±0.3μs     0.54  bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>)
-         327±1μs          176±1μs     0.54  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
-      216±0.07μs       50.3±0.1μs     0.23  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>)
-         215±1μs         50.0±1μs     0.23  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>)
-       233±0.5μs      50.6±0.06μs     0.22  bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>)
-       232±0.9μs       48.9±0.9μs     0.21  bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>)
-      216±0.06μs       28.7±0.1μs     0.13  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>)
-       216±0.6μs      28.5±0.04μs     0.13  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>)
-       214±0.5μs       28.0±0.1μs     0.13  bench_reduce.ArgMax.time_argmax(<class 'bool'>)
-       233±0.1μs       28.6±0.1μs     0.12  bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)
-       233±0.2μs       28.6±0.1μs     0.12  bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>)

AArch64

CPU

Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          2
On-line CPU(s) list:             0,1
Thread(s) per core:              1
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           1
Model name:                      Neoverse-N1
Stepping:                        r3p1
BogoMIPS:                        243.75
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        2 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0,1
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

OS

Linux ip-172-31-44-172 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:01:34 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

baseline(ASIMD)

python runtests.py --bench-compare parent/main "argmax|argmin" -- --sort ratio

       before           after         ratio
     [4b985e2b]       [c99921ad]
     <simd_argmaxmin~1>       <simd_argmaxmin>
-     20.5±0.02μs      18.2±0.01μs     0.89  bench_reduce.ArgMax.time_argmax(<class 'bool'>)
-         169±2μs         88.9±4μs     0.53  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>)
-         169±2μs         88.9±3μs     0.53  bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>)
-       165±0.1μs       83.1±0.6μs     0.50  bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>)
-       165±0.3μs       82.8±0.5μs     0.50  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint64'>)
-         246±1μs       81.1±0.9μs     0.33  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
-       244±0.2μs       79.1±0.2μs     0.32  bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>)
-       166±0.8μs         45.1±2μs     0.27  bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>)
-       166±0.9μs         44.8±1μs     0.27  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>)
-       164±0.2μs       43.1±0.3μs     0.26  bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>)
-      164±0.06μs       43.0±0.2μs     0.26  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>)
-       170±0.4μs       24.9±0.5μs     0.15  bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>)
-       170±0.3μs       24.9±0.5μs     0.15  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>)
-       169±0.1μs       24.6±0.2μs     0.15  bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>)
-       169±0.2μs      24.5±0.07μs     0.15  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>)
-       163±0.1μs      13.9±0.04μs     0.08  bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)
-       163±0.1μs      13.9±0.09μs     0.08  bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>)
-      163±0.09μs      13.8±0.01μs     0.08  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>)
-       163±0.1μs      13.8±0.03μs     0.08  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>)

Binary size(striped)

LIB	Before(KB)	After(KB)	Diff(KB)
_multiarray_umath.cpython-38-x86_64-linux-gnu.so	4428	4532	104
_multiarray_umath.cpython-38-powerpc64le-linux-gnu.so	4264	4292	29
_multiarray_umath.cpython-38-aarch64-linux-gnu.so	3424	3444	20

seiko2plus

npyv_cleanup() missing, it's important to be represented in case of the compiler doesn't generate zeroupper or zeroall to avoid the AVX-SSE transition penalty.

numpy/core/src/multiarray/argfunc.dispatch.c.src

mattip

Cool speedups! If I understand correctly, the strategy is to use 512-bit wide vectors (svx) where possible and "emulate" them with smaller vectors where not possible. This provides the most compact code since we do not need 256- or 128-bit-specific loops. Does the compiler "do the right thing" to make the code run at the same speed as if we wrote 128-bit specific loops?

There are a few spots that coverage claims are not hit.

mattip · 2022-01-18T17:15:10Z

benchmarks/benchmarks/bench_reduce.py

+
+    def time_argmin(self, dtype):
+        np.argmin(self.d)
+


~~Maybe overkill? Is there any real difference between argmax and argmin?~~ I guess there might slight differences since they are different loops

numpy/core/src/multiarray/argfunc.dispatch.c.src

benchmarks/benchmarks/bench_reduce.py

seiko2plus · 2022-01-18T23:45:04Z

@mattip, sorry, I don't clearly understand your question but this implementation works perfectly with all current supported of SIMD widths (128, 256, 512) and there's no kind of scalar emulation. it just requires the minimum length of the array to be SIMD_WIDTH*4 because the reduced operation for the final vector accumulator isn't cheap and it will cause performance regressions on small arrays. also, the reason behind unrolling by 4 vector loads for each iteration is based on the fact that most of the new CPUs nowadays can execute multiple instructions logical, bitwise, and comparison operations at the same time.

mattip · 2022-01-19T05:37:23Z

Thanks, unrolling by 4 was the thing that caught my eye, and your explanation makes it clear.

mattip · 2022-01-19T05:47:19Z

Looking at the benchmark timings, it seems to be that across architectures the current code is operation-bound (all the dtypes except bool take approximately the same time), where the new code make the ufunc performance more memory-bound (smaller dtypes are faster), and that there is a cost to processing nans.

seiko2plus · 2022-01-20T06:11:45Z

@mattip,

except bool take approximately the same time

for two reasons:

BOOL_argmin() isn't included by pr because memchr is well optimized by GCC and CLANG.

numpy/numpy/core/src/multiarray/arraytypes.c.src

Lines 3262 to 3274 in ed84326

    
           static int 
        
           BOOL_argmin(npy_bool *ip, npy_intp n, npy_intp *min_ind, 
        
                       PyArrayObject *NPY_UNUSED(aip)) 
        
           { 
        
               npy_bool * p = memchr(ip, 0, n * sizeof(*ip)); 
        
               if (p == NULL) { 
        
                   *min_ind = 0; 
        
                   return 0; 
        
               } 
        
               *min_ind = p - ip; 
        
               return 0; 
        
           }

BOOL_argmax() was already vectorized by raw SIMD for both x86 and aarch64. note on VSX, it shows pretty good improvement:

-       214±0.5μs       28.0±0.1μs     0.13  bench_reduce.ArgMax.time_argmax(<class 'bool'>)

that there is a cost to processing nans.

that's true but I can give the priority for numeric values and gain around 40% speed but that will cause a pretty bad impact on nanargmax/nanargmin performance. my idea is to invert the FP test and avoid breaking the loop. By the way, the nan benchmark isn't fear enough, and it doesn't expose the real improvements. see:

numpy/benchmarks/benchmarks/bench_lib.py

Lines 72 to 74 in ed84326

    
           base_array = np.random.uniform(size=array_size) 
        
           base_array[base_array < percent_nans / 100.] = np.nan 
        
           self.arr = base_array

for all integers, f32 and f64 data types on all supported architectures via universal intrinsics.

to avoid the AVX-SSE transition penalty.

seiko2plus · 2022-02-13T14:28:23Z

@mattip, code coverage almost hit all the paths. I think it's ready for merge.

mattip · 2022-02-13T14:39:51Z

Thanks @seiko2plus

mattip · 2022-02-13T14:40:54Z

I forgot, do we add release notes for these performance enhancements? Maybe we could add a single note and mention all the ufuncs we have converted to SIMD.

seberg · 2022-02-13T14:57:20Z

We have a "performance" category. So we do, but are very very spotty about it. It is also OK to add multiple for each I think, just make sure to use the brief bullet format * Improved performance of ....

github-actions bot added the 01 - Enhancement label Jan 18, 2022

seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jan 18, 2022

seiko2plus marked this pull request as ready for review January 18, 2022 03:37

seiko2plus mentioned this pull request Jan 18, 2022

BUG: min/max is slow, re-implement using NEON (#17989) #20131

Merged

seiko2plus commented Jan 18, 2022

View reviewed changes

numpy/core/src/multiarray/argfunc.dispatch.c.src Show resolved Hide resolved

numpy/core/src/multiarray/argfunc.dispatch.c.src Show resolved Hide resolved

mattip reviewed Jan 18, 2022

View reviewed changes

benchmarks/benchmarks/bench_reduce.py Show resolved Hide resolved

seiko2plus force-pushed the simd_argmaxmin branch from 3c39c06 to ed84326 Compare January 20, 2022 05:40

seiko2plus force-pushed the simd_argmaxmin branch from ed84326 to 60972df Compare January 20, 2022 06:41

seiko2plus force-pushed the simd_argmaxmin branch 2 times, most recently from 7f32216 to 88c13b9 Compare February 13, 2022 12:27

seiko2plus added 2 commits February 13, 2022 14:32

ENH, SIMD: improve argmax/argmin performance

762538c

for all integers, f32 and f64 data types on all supported architectures via universal intrinsics.

add npyv_cleanup()

52787cc

to avoid the AVX-SSE transition penalty.

seiko2plus force-pushed the simd_argmaxmin branch from 88c13b9 to 52787cc Compare February 13, 2022 12:33

mattip merged commit 7fe5d7f into numpy:main Feb 13, 2022

seiko2plus added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Feb 13, 2022

BvB93 mentioned this pull request Feb 15, 2022

TYP: Add dtype-typing support to fromnumeric part 3 #21060

Merged

ev-br mentioned this pull request Dec 26, 2022

TST: tests/core/test_multiarray:TestArg{Min,Max}: stray np.repeat #22891

Merged

Uh oh!

Conversation

seiko2plus commented Jan 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

X86

Benchmark

Power little-endian

Benchmark

AArch64

Benchmark

Binary size(striped)

Uh oh!

seiko2plus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mattip left a comment

Choose a reason for hiding this comment

Uh oh!

mattip Jan 18, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seiko2plus commented Jan 18, 2022

Uh oh!

mattip commented Jan 19, 2022

Uh oh!

mattip commented Jan 19, 2022

Uh oh!

seiko2plus commented Jan 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seiko2plus commented Feb 13, 2022

Uh oh!

mattip commented Feb 13, 2022

Uh oh!

mattip commented Feb 13, 2022

Uh oh!

seberg commented Feb 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

seiko2plus commented Jan 18, 2022 •

edited

Loading

seiko2plus commented Jan 20, 2022 •

edited

Loading