ENH:Umath Replace raw SIMD of unary float point(32-64) with NPYV - g0 by seiko2plus · Pull Request #16247 · numpy/numpy

seiko2plus · 2020-05-15T01:15:10Z

This pull-request:

replaces the raw SIMD of sqrt, absolute, square, reciprocal
with NumPy C SIMD vectorization interface(NPYV).
fix SIMD memory overlap check for aliasing(same ptr & stride)
unify fp/domain errors for both scalars and vectors,
which lead to fix AVX test failures for 32 bit manylinux1 wheels #17174.
improves float32 division precision on NEON/A32
add new NPYV intrinsics sqrt, abs, recip and square
reorder Python.h to suppress warning 'declaration of 'struct timespec*'

merge after #17340
closes #17174
TODO:

Performance tests

Args used within #15987

--filter "(absol*|recip*|sqrt|square).*[fd]::.*->" --strides 1 2 10 --msleep 1 --iteration 100

Note: --msleep 1 force the running thread to sleep 1 millisecond before collecting each sample
to revert any frequency reduction, since it seems that throttling effect on wall time when AVX512F is enabled.

X86

CPU

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              2
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GH
                                 z
Stepping:                        7
CPU MHz:                         3604.410
BogoMIPS:                        6000.00
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        2 MiB
L3 cache:                        35.8 MiB
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     KVM: Vulnerable
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no m
                                 icrocode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __us
                                 er pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, STIBP dis
                                 abled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep m
                                 trr pge mca cmov pat pse36 clflush mmx fxsr s
                                 se sse2 ss ht syscall nx pdpe1gb rdtscp lm co
                                 nstant_tsc rep_good nopl xtopology nonstop_ts
                                 c cpuid aperfmperf tsc_known_freq pni pclmulq
                                 dq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic m
                                 ovbe popcnt tsc_deadline_timer aes xsave avx 
                                 f16c rdrand hypervisor lahf_lm abm 3dnowprefe
                                 tch invpcid_single pti fsgsbase tsc_adjust bm
                                 i1 avx2 smep bmi2 erms invpcid mpx avx512f av
                                 x512dq rdseed adx smap clflushopt clwb avx512
                                 cd avx512bw avx512vl xsaveopt xsavec xgetbv1 
                                 xsaves ida arat pku ospke

OS

Linux ip-172-31-28-146 5.4.0-1025-aws
gcc version 7.5.0 (Ubuntu 7.5.0-6ubuntu2)

Benchmark

AVX512F - Contiguous only

metric: gmean, units: ms

name of test	before_avx512f	after_sse3	after_sse3 vs before_avx512f
absolute::1024 d::1 -> d::1	0.0004	0.0002	`2.37`
absolute::2048 d::1 -> d::1	0.0007	0.0004	`1.82`
absolute::4096 d::1 -> d::1	0.0013	0.0009	`1.44`
absolute::1024 f::1 -> f::1	0.0002	0.0001	`2.6`
absolute::2048 f::1 -> f::1	0.0004	0.0002	`2.46`
absolute::4096 f::1 -> f::1	0.0007	0.0004	`1.5`
reciprocal::1024 d::1 -> d::1	0.0008	0.0006	`1.33`
reciprocal::2048 d::1 -> d::1	0.0014	0.0011	`1.24`
reciprocal::4096 d::1 -> d::1	0.0027	0.0023	`1.2`
reciprocal::1024 f::1 -> f::1	0.0003	0.0002	`1.5`
reciprocal::2048 f::1 -> f::1	0.0007	0.0004	`1.61`
reciprocal::4096 f::1 -> f::1	0.0011	0.0009	`1.28`
sqrt::1024 d::1 -> d::1	0.0011	0.0009	`1.26`
sqrt::2048 d::1 -> d::1	0.0021	0.0017	`1.21`
sqrt::4096 d::1 -> d::1	0.0041	0.0034	`1.19`
sqrt::1024 f::1 -> f::1	0.0003	0.0002	`1.5`
sqrt::2048 f::1 -> f::1	0.0007	0.0004	`1.62`
sqrt::4096 f::1 -> f::1	0.0012	0.0009	`1.4`
square::1024 d::1 -> d::1	0.0004	0.0002	`2.25`
square::2048 d::1 -> d::1	0.0007	0.0004	`1.97`
square::4096 d::1 -> d::1	0.0013	0.0009	`1.51`
square::1024 f::1 -> f::1	0.0002	0.0001	`1.96`
square::2048 f::1 -> f::1	0.0004	0.0002	`1.84`
square::4096 f::1 -> f::1	0.0007	0.0004	`1.58`

AVX512F

metric: gmean, units: ms

name of test	before_avx512f	after_sse3	after_sse3 vs before_avx512f
absolute::1024 d::1 -> d::1	0.0004	0.0002	`2.36`
absolute::2048 d::1 -> d::1	0.0007	0.0004	`1.81`
absolute::4096 d::1 -> d::1	0.0013	0.0009	`1.44`
absolute::1024 d::1 -> d::2	0.0007	0.0003	`2.23`
absolute::2048 d::1 -> d::2	0.0013	0.0009	`1.53`
absolute::4096 d::1 -> d::2	0.0027	0.0018	`1.52`
absolute::1024 d::1 -> d::10	0.0010	0.0010	1.0
absolute::2048 d::1 -> d::10	0.0021	0.0020	1.0
absolute::4096 d::1 -> d::10	0.0041	0.0041	1.0
absolute::1024 d::2 -> d::1	0.0004	0.0002	`1.87`
absolute::2048 d::2 -> d::1	0.0010	0.0007	`1.48`
absolute::4096 d::2 -> d::1	0.0018	0.0013	`1.37`
absolute::1024 d::2 -> d::2	0.0008	0.0005	`1.7`
absolute::2048 d::2 -> d::2	0.0016	0.0009	`1.76`
absolute::4096 d::2 -> d::2	0.0029	0.0018	`1.55`
absolute::1024 d::2 -> d::10	0.0011	0.0011	0.99
absolute::2048 d::2 -> d::10	0.0022	0.0022	0.98
absolute::4096 d::2 -> d::10	0.0043	0.0044	0.99
absolute::1024 d::10 -> d::1	0.0008	0.0005	`1.46`
absolute::2048 d::10 -> d::1	0.0013	0.0011	`1.26`
absolute::4096 d::10 -> d::1	0.0026	0.0021	`1.22`
absolute::1024 d::10 -> d::2	0.0008	0.0007	`1.13`
absolute::2048 d::10 -> d::2	0.0015	0.0014	`1.11`
absolute::4096 d::10 -> d::2	0.0031	0.0028	`1.12`
absolute::1024 d::10 -> d::10	0.0015	0.0014	`1.07`
absolute::2048 d::10 -> d::10	0.0029	0.0028	1.04
absolute::4096 d::10 -> d::10	0.0059	0.0056	1.05
absolute::1024 f::1 -> f::1	0.0002	0.0001	`2.42`
absolute::2048 f::1 -> f::1	0.0004	0.0002	`2.42`
absolute::4096 f::1 -> f::1	0.0007	0.0004	`1.6`
absolute::1024 f::1 -> f::2	0.0007	0.0003	`2.23`
absolute::2048 f::1 -> f::2	0.0013	0.0006	`2.26`
absolute::4096 f::1 -> f::2	0.0027	0.0014	`1.91`
absolute::1024 f::1 -> f::10	0.0008	0.0007	`1.1`
absolute::2048 f::1 -> f::10	0.0015	0.0013	`1.14`
absolute::4096 f::1 -> f::10	0.0030	0.0026	`1.13`
absolute::1024 f::2 -> f::1	0.0003	0.0003	`1.3`
absolute::2048 f::2 -> f::1	0.0007	0.0005	`1.33`
absolute::4096 f::2 -> f::1	0.0014	0.0012	`1.22`
absolute::1024 f::2 -> f::2	0.0008	0.0005	`1.7`
absolute::2048 f::2 -> f::2	0.0015	0.0009	`1.68`
absolute::4096 f::2 -> f::2	0.0027	0.0018	`1.52`
absolute::1024 f::2 -> f::10	0.0008	0.0007	`1.08`
absolute::2048 f::2 -> f::10	0.0016	0.0014	`1.1`
absolute::4096 f::2 -> f::10	0.0031	0.0028	`1.12`
absolute::1024 f::10 -> f::1	0.0006	0.0004	`1.45`
absolute::2048 f::10 -> f::1	0.0010	0.0008	`1.25`
absolute::4096 f::10 -> f::1	0.0019	0.0016	`1.18`
absolute::1024 f::10 -> f::2	0.0008	0.0005	`1.41`
absolute::2048 f::10 -> f::2	0.0015	0.0011	`1.43`
absolute::4096 f::10 -> f::2	0.0030	0.0021	`1.44`
absolute::1024 f::10 -> f::10	0.0008	0.0008	`1.06`
absolute::2048 f::10 -> f::10	0.0017	0.0016	1.05
absolute::4096 f::10 -> f::10	0.0033	0.0032	1.05
reciprocal::1024 d::1 -> d::1	0.0007	0.0006	`1.32`
reciprocal::2048 d::1 -> d::1	0.0014	0.0011	`1.24`
reciprocal::4096 d::1 -> d::1	0.0027	0.0023	`1.2`
reciprocal::1024 d::1 -> d::2	0.0011	0.0006	`2.0`
reciprocal::2048 d::1 -> d::2	0.0023	0.0011	`1.99`
reciprocal::4096 d::1 -> d::2	0.0046	0.0023	`2.0`
reciprocal::1024 d::1 -> d::10	0.0011	0.0010	`1.13`
reciprocal::2048 d::1 -> d::10	0.0023	0.0020	`1.13`
reciprocal::4096 d::1 -> d::10	0.0046	0.0041	`1.12`
reciprocal::1024 d::2 -> d::1	0.0008	0.0006	`1.33`
reciprocal::2048 d::2 -> d::1	0.0014	0.0011	`1.24`
reciprocal::4096 d::2 -> d::1	0.0027	0.0023	`1.2`
reciprocal::1024 d::2 -> d::2	0.0011	0.0006	`1.97`
reciprocal::2048 d::2 -> d::2	0.0023	0.0011	`1.98`
reciprocal::4096 d::2 -> d::2	0.0046	0.0023	`1.99`
reciprocal::1024 d::2 -> d::10	0.0011	0.0011	`1.06`
reciprocal::2048 d::2 -> d::10	0.0023	0.0022	1.05
reciprocal::4096 d::2 -> d::10	0.0046	0.0044	`1.06`
reciprocal::1024 d::10 -> d::1	0.0008	0.0006	`1.36`
reciprocal::2048 d::10 -> d::1	0.0014	0.0015	0.96
reciprocal::4096 d::10 -> d::1	0.0028	0.0023	`1.21`
reciprocal::1024 d::10 -> d::2	0.0011	0.0007	`1.63`
reciprocal::2048 d::10 -> d::2	0.0023	0.0014	`1.64`
reciprocal::4096 d::10 -> d::2	0.0046	0.0028	`1.63`
reciprocal::1024 d::10 -> d::10	0.0015	0.0014	1.05
reciprocal::2048 d::10 -> d::10	0.0029	0.0028	1.04
reciprocal::4096 d::10 -> d::10	0.0058	0.0056	1.04
reciprocal::1024 f::1 -> f::1	0.0003	0.0002	`1.5`
reciprocal::2048 f::1 -> f::1	0.0007	0.0004	`1.61`
reciprocal::4096 f::1 -> f::1	0.0011	0.0009	`1.27`
reciprocal::1024 f::1 -> f::2	0.0009	0.0003	`2.78`
reciprocal::2048 f::1 -> f::2	0.0017	0.0006	`2.89`
reciprocal::4096 f::1 -> f::2	0.0034	0.0014	`2.45`
reciprocal::1024 f::1 -> f::10	0.0009	0.0007	`1.3`
reciprocal::2048 f::1 -> f::10	0.0017	0.0013	`1.29`
reciprocal::4096 f::1 -> f::10	0.0034	0.0026	`1.33`
reciprocal::1024 f::2 -> f::1	0.0003	0.0003	`1.35`
reciprocal::2048 f::2 -> f::1	0.0007	0.0005	`1.43`
reciprocal::4096 f::2 -> f::1	0.0015	0.0012	`1.25`
reciprocal::1024 f::2 -> f::2	0.0009	0.0005	`1.74`
reciprocal::2048 f::2 -> f::2	0.0017	0.0010	`1.7`
reciprocal::4096 f::2 -> f::2	0.0034	0.0020	`1.7`
reciprocal::1024 f::2 -> f::10	0.0009	0.0007	`1.18`
reciprocal::2048 f::2 -> f::10	0.0017	0.0014	`1.2`
reciprocal::4096 f::2 -> f::10	0.0034	0.0028	`1.2`
reciprocal::1024 f::10 -> f::1	0.0005	0.0004	`1.24`
reciprocal::2048 f::10 -> f::1	0.0010	0.0008	`1.27`
reciprocal::4096 f::10 -> f::1	0.0019	0.0017	`1.12`
reciprocal::1024 f::10 -> f::2	0.0009	0.0006	`1.5`
reciprocal::2048 f::10 -> f::2	0.0017	0.0011	`1.51`
reciprocal::4096 f::10 -> f::2	0.0034	0.0022	`1.53`
reciprocal::1024 f::10 -> f::10	0.0009	0.0008	`1.11`
reciprocal::2048 f::10 -> f::10	0.0017	0.0016	`1.08`
reciprocal::4096 f::10 -> f::10	0.0034	0.0032	`1.08`
sqrt::1024 d::1 -> d::1	0.0011	0.0009	`1.27`
sqrt::2048 d::1 -> d::1	0.0021	0.0017	`1.21`
sqrt::4096 d::1 -> d::1	0.0041	0.0034	`1.19`
sqrt::1024 d::1 -> d::2	0.0054	0.0009	`6.33`
sqrt::2048 d::1 -> d::2	0.0108	0.0017	`6.32`
sqrt::4096 d::1 -> d::2	0.0217	0.0034	`6.35`
sqrt::1024 d::1 -> d::10	0.0054	0.0010	`5.15`
sqrt::2048 d::1 -> d::10	0.0108	0.0020	`5.3`
sqrt::4096 d::1 -> d::10	0.0217	0.0041	`5.33`
sqrt::1024 d::2 -> d::1	0.0011	0.0009	`1.26`
sqrt::2048 d::2 -> d::1	0.0021	0.0017	`1.21`
sqrt::4096 d::2 -> d::1	0.0041	0.0034	`1.19`
sqrt::1024 d::2 -> d::2	0.0054	0.0009	`6.33`
sqrt::2048 d::2 -> d::2	0.0108	0.0017	`6.34`
sqrt::4096 d::2 -> d::2	0.0217	0.0034	`6.35`
sqrt::1024 d::2 -> d::10	0.0054	0.0011	`4.92`
sqrt::2048 d::2 -> d::10	0.0108	0.0022	`4.97`
sqrt::4096 d::2 -> d::10	0.0217	0.0044	`4.94`
sqrt::1024 d::10 -> d::1	0.0011	0.0009	`1.26`
sqrt::2048 d::10 -> d::1	0.0021	0.0017	`1.21`
sqrt::4096 d::10 -> d::1	0.0041	0.0034	`1.19`
sqrt::1024 d::10 -> d::2	0.0054	0.0009	`6.32`
sqrt::2048 d::10 -> d::2	0.0108	0.0017	`6.34`
sqrt::4096 d::10 -> d::2	0.0217	0.0034	`6.33`
sqrt::1024 d::10 -> d::10	0.0054	0.0014	`3.86`
sqrt::2048 d::10 -> d::10	0.0108	0.0028	`3.87`
sqrt::4096 d::10 -> d::10	0.0217	0.0056	`3.86`
sqrt::1024 f::1 -> f::1	0.0003	0.0002	`1.63`
sqrt::2048 f::1 -> f::1	0.0007	0.0004	`1.61`
sqrt::4096 f::1 -> f::1	0.0012	0.0009	`1.4`
sqrt::1024 f::1 -> f::2	0.0037	0.0003	`12.14`
sqrt::2048 f::1 -> f::2	0.0074	0.0006	`12.51`
sqrt::4096 f::1 -> f::2	0.0148	0.0014	`10.57`
sqrt::1024 f::1 -> f::10	0.0037	0.0007	`5.51`
sqrt::2048 f::1 -> f::10	0.0074	0.0013	`5.62`
sqrt::4096 f::1 -> f::10	0.0148	0.0026	`5.66`
sqrt::1024 f::2 -> f::1	0.0004	0.0003	`1.5`
sqrt::2048 f::2 -> f::1	0.0008	0.0005	`1.61`
sqrt::4096 f::2 -> f::1	0.0014	0.0012	`1.24`
sqrt::1024 f::2 -> f::2	0.0037	0.0005	`7.36`
sqrt::2048 f::2 -> f::2	0.0074	0.0010	`7.32`
sqrt::4096 f::2 -> f::2	0.0148	0.0020	`7.33`
sqrt::1024 f::2 -> f::10	0.0037	0.0007	`5.12`
sqrt::2048 f::2 -> f::10	0.0074	0.0014	`5.21`
sqrt::4096 f::2 -> f::10	0.0148	0.0028	`5.35`
sqrt::1024 f::10 -> f::1	0.0006	0.0004	`1.49`
sqrt::2048 f::10 -> f::1	0.0011	0.0008	`1.31`
sqrt::4096 f::10 -> f::1	0.0019	0.0016	`1.2`
sqrt::1024 f::10 -> f::2	0.0037	0.0006	`6.42`
sqrt::2048 f::10 -> f::2	0.0074	0.0011	`6.56`
sqrt::4096 f::10 -> f::2	0.0148	0.0023	`6.47`
sqrt::1024 f::10 -> f::10	0.0037	0.0008	`4.64`
sqrt::2048 f::10 -> f::10	0.0074	0.0016	`4.66`
sqrt::4096 f::10 -> f::10	0.0148	0.0032	`4.63`
square::1024 d::1 -> d::1	0.0004	0.0002	`2.17`
square::2048 d::1 -> d::1	0.0007	0.0004	`1.92`
square::4096 d::1 -> d::1	0.0012	0.0009	`1.34`
square::1024 d::1 -> d::2	0.0006	0.0003	`1.91`
square::2048 d::1 -> d::2	0.0011	0.0009	`1.28`
square::4096 d::1 -> d::2	0.0023	0.0018	`1.29`
square::1024 d::1 -> d::10	0.0010	0.0010	1.0
square::2048 d::1 -> d::10	0.0020	0.0020	0.99
square::4096 d::1 -> d::10	0.0041	0.0041	1.0
square::1024 d::2 -> d::1	0.0004	0.0003	`1.61`
square::2048 d::2 -> d::1	0.0010	0.0007	`1.52`
square::4096 d::2 -> d::1	0.0018	0.0013	`1.35`
square::1024 d::2 -> d::2	0.0006	0.0004	`1.65`
square::2048 d::2 -> d::2	0.0012	0.0009	`1.31`
square::4096 d::2 -> d::2	0.0023	0.0018	`1.29`
square::1024 d::2 -> d::10	0.0011	0.0011	0.98
square::2048 d::2 -> d::10	0.0022	0.0022	1.0
square::4096 d::2 -> d::10	0.0043	0.0043	1.0
square::1024 d::10 -> d::1	0.0008	0.0005	`1.43`
square::2048 d::10 -> d::1	0.0013	0.0011	`1.28`
square::4096 d::10 -> d::1	0.0025	0.0021	`1.18`
square::1024 d::10 -> d::2	0.0007	0.0007	`1.07`
square::2048 d::10 -> d::2	0.0014	0.0014	1.0
square::4096 d::10 -> d::2	0.0028	0.0028	1.0
square::1024 d::10 -> d::10	0.0014	0.0014	1.04
square::2048 d::10 -> d::10	0.0029	0.0028	1.03
square::4096 d::10 -> d::10	0.0058	0.0062	`0.93`
square::1024 f::1 -> f::1	0.0002	0.0001	`1.92`
square::2048 f::1 -> f::1	0.0004	0.0002	`1.82`
square::4096 f::1 -> f::1	0.0007	0.0004	`1.63`
square::1024 f::1 -> f::2	0.0005	0.0003	`1.7`
square::2048 f::1 -> f::2	0.0010	0.0006	`1.71`
square::4096 f::1 -> f::2	0.0020	0.0014	`1.42`
square::1024 f::1 -> f::10	0.0007	0.0007	1.03
square::2048 f::1 -> f::10	0.0013	0.0013	1.0
square::4096 f::1 -> f::10	0.0026	0.0026	1.02
square::1024 f::2 -> f::1	0.0003	0.0002	`1.32`
square::2048 f::2 -> f::1	0.0007	0.0005	`1.4`
square::4096 f::2 -> f::1	0.0014	0.0012	`1.2`
square::1024 f::2 -> f::2	0.0006	0.0005	`1.2`
square::2048 f::2 -> f::2	0.0011	0.0009	`1.19`
square::4096 f::2 -> f::2	0.0020	0.0018	`1.11`
square::1024 f::2 -> f::10	0.0007	0.0007	0.97
square::2048 f::2 -> f::10	0.0014	0.0014	0.98
square::4096 f::2 -> f::10	0.0028	0.0028	0.99
square::1024 f::10 -> f::1	0.0006	0.0004	`1.43`
square::2048 f::10 -> f::1	0.0010	0.0008	`1.25`
square::4096 f::10 -> f::1	0.0019	0.0016	`1.19`
square::1024 f::10 -> f::2	0.0005	0.0005	1.0
square::2048 f::10 -> f::2	0.0011	0.0011	1.0
square::4096 f::10 -> f::2	0.0022	0.0021	1.03
square::1024 f::10 -> f::10	0.0008	0.0008	1.02
square::2048 f::10 -> f::10	0.0016	0.0016	0.98
square::4096 f::10 -> f::10	0.0032	0.0032	0.98

AVX2 - Contiguous only

metric: gmean, units: ms

name of test	before_avx2	after_sse3	after_sse3 vs before_avx2
absolute::1024 d::1 -> d::1	0.0004	0.0002	`2.27`
absolute::2048 d::1 -> d::1	0.0006	0.0004	`1.6`
absolute::4096 d::1 -> d::1	0.0011	0.0009	`1.19`
absolute::1024 f::1 -> f::1	0.0003	0.0001	`3.32`
absolute::2048 f::1 -> f::1	0.0004	0.0002	`2.37`
absolute::4096 f::1 -> f::1	0.0006	0.0004	`1.48`
reciprocal::1024 d::1 -> d::1	0.0006	0.0006	`1.14`
reciprocal::2048 d::1 -> d::1	0.0012	0.0011	`1.06`
reciprocal::4096 d::1 -> d::1	0.0023	0.0023	1.02
reciprocal::1024 f::1 -> f::1	0.0003	0.0002	`1.55`
reciprocal::2048 f::1 -> f::1	0.0005	0.0004	`1.17`
reciprocal::4096 f::1 -> f::1	0.0008	0.0009	0.96
sqrt::1024 d::1 -> d::1	0.0009	0.0009	1.03
sqrt::2048 d::1 -> d::1	0.0017	0.0017	1.01
sqrt::4096 d::1 -> d::1	0.0034	0.0034	1.0
sqrt::1024 f::1 -> f::1	0.0003	0.0002	`1.56`
sqrt::2048 f::1 -> f::1	0.0005	0.0004	`1.24`
sqrt::4096 f::1 -> f::1	0.0009	0.0009	`1.1`
square::1024 d::1 -> d::1	0.0004	0.0002	`2.1`
square::2048 d::1 -> d::1	0.0006	0.0004	`1.67`
square::4096 d::1 -> d::1	0.0011	0.0009	`1.28`
square::1024 f::1 -> f::1	0.0004	0.0001	`3.93`
square::2048 f::1 -> f::1	0.0005	0.0002	`2.04`
square::4096 f::1 -> f::1	0.0008	0.0004	`1.88`

AVX2

metric: gmean, units: ms

name of test	before_avx2	after_sse3	after_sse3 vs before_avx2
absolute::1024 d::1 -> d::1	0.0004	0.0002	`2.22`
absolute::2048 d::1 -> d::1	0.0006	0.0004	`1.58`
absolute::4096 d::1 -> d::1	0.0011	0.0009	`1.25`
absolute::1024 d::1 -> d::2	0.0012	0.0003	`3.78`
absolute::2048 d::1 -> d::2	0.0023	0.0009	`2.6`
absolute::4096 d::1 -> d::2	0.0046	0.0018	`2.59`
absolute::1024 d::1 -> d::10	0.0012	0.0010	`1.11`
absolute::2048 d::1 -> d::10	0.0023	0.0020	`1.12`
absolute::4096 d::1 -> d::10	0.0046	0.0041	`1.13`
absolute::1024 d::2 -> d::1	0.0006	0.0002	`2.68`
absolute::2048 d::2 -> d::1	0.0011	0.0007	`1.64`
absolute::4096 d::2 -> d::1	0.0021	0.0013	`1.6`
absolute::1024 d::2 -> d::2	0.0012	0.0005	`2.47`
absolute::2048 d::2 -> d::2	0.0023	0.0009	`2.44`
absolute::4096 d::2 -> d::2	0.0046	0.0018	`2.48`
absolute::1024 d::2 -> d::10	0.0012	0.0011	`1.07`
absolute::2048 d::2 -> d::10	0.0024	0.0022	`1.06`
absolute::4096 d::2 -> d::10	0.0048	0.0044	`1.09`
absolute::1024 d::10 -> d::1	0.0007	0.0005	`1.26`
absolute::2048 d::10 -> d::1	0.0012	0.0011	`1.13`
absolute::4096 d::10 -> d::1	0.0023	0.0021	`1.08`
absolute::1024 d::10 -> d::2	0.0012	0.0007	`1.65`
absolute::2048 d::10 -> d::2	0.0023	0.0014	`1.65`
absolute::4096 d::10 -> d::2	0.0046	0.0028	`1.64`
absolute::1024 d::10 -> d::10	0.0015	0.0014	`1.07`
absolute::2048 d::10 -> d::10	0.0029	0.0028	1.04
absolute::4096 d::10 -> d::10	0.0057	0.0056	1.02
absolute::1024 f::1 -> f::1	0.0003	0.0001	`3.51`
absolute::2048 f::1 -> f::1	0.0004	0.0002	`2.36`
absolute::4096 f::1 -> f::1	0.0006	0.0004	`1.48`
absolute::1024 f::1 -> f::2	0.0009	0.0003	`2.86`
absolute::2048 f::1 -> f::2	0.0017	0.0006	`2.89`
absolute::4096 f::1 -> f::2	0.0034	0.0014	`2.46`
absolute::1024 f::1 -> f::10	0.0009	0.0007	`1.26`
absolute::2048 f::1 -> f::10	0.0017	0.0013	`1.31`
absolute::4096 f::1 -> f::10	0.0034	0.0026	`1.3`
absolute::1024 f::2 -> f::1	0.0004	0.0003	`1.5`
absolute::2048 f::2 -> f::1	0.0006	0.0005	`1.25`
absolute::4096 f::2 -> f::1	0.0013	0.0012	`1.12`
absolute::1024 f::2 -> f::2	0.0009	0.0005	`1.93`
absolute::2048 f::2 -> f::2	0.0017	0.0009	`1.9`
absolute::4096 f::2 -> f::2	0.0034	0.0018	`1.95`
absolute::1024 f::2 -> f::10	0.0009	0.0007	`1.2`
absolute::2048 f::2 -> f::10	0.0017	0.0014	`1.22`
absolute::4096 f::2 -> f::10	0.0034	0.0028	`1.24`
absolute::1024 f::10 -> f::1	0.0005	0.0004	`1.09`
absolute::2048 f::10 -> f::1	0.0008	0.0008	1.01
absolute::4096 f::10 -> f::1	0.0015	0.0016	0.95
absolute::1024 f::10 -> f::2	0.0009	0.0005	`1.61`
absolute::2048 f::10 -> f::2	0.0017	0.0011	`1.63`
absolute::4096 f::10 -> f::2	0.0034	0.0021	`1.65`
absolute::1024 f::10 -> f::10	0.0009	0.0008	`1.1`
absolute::2048 f::10 -> f::10	0.0017	0.0016	`1.1`
absolute::4096 f::10 -> f::10	0.0035	0.0032	`1.1`
reciprocal::1024 d::1 -> d::1	0.0006	0.0006	`1.14`
reciprocal::2048 d::1 -> d::1	0.0012	0.0011	1.05
reciprocal::4096 d::1 -> d::1	0.0023	0.0023	1.02
reciprocal::1024 d::1 -> d::2	0.0011	0.0006	`2.0`
reciprocal::2048 d::1 -> d::2	0.0023	0.0011	`1.99`
reciprocal::4096 d::1 -> d::2	0.0046	0.0023	`2.0`
reciprocal::1024 d::1 -> d::10	0.0011	0.0010	`1.13`
reciprocal::2048 d::1 -> d::10	0.0023	0.0020	`1.13`
reciprocal::4096 d::1 -> d::10	0.0046	0.0041	`1.12`
reciprocal::1024 d::2 -> d::1	0.0007	0.0006	`1.15`
reciprocal::2048 d::2 -> d::1	0.0012	0.0011	`1.06`
reciprocal::4096 d::2 -> d::1	0.0023	0.0023	1.01
reciprocal::1024 d::2 -> d::2	0.0011	0.0006	`1.97`
reciprocal::2048 d::2 -> d::2	0.0023	0.0011	`1.98`
reciprocal::4096 d::2 -> d::2	0.0046	0.0023	`1.99`
reciprocal::1024 d::2 -> d::10	0.0011	0.0011	`1.06`
reciprocal::2048 d::2 -> d::10	0.0023	0.0022	1.04
reciprocal::4096 d::2 -> d::10	0.0046	0.0044	1.05
reciprocal::1024 d::10 -> d::1	0.0007	0.0006	`1.17`
reciprocal::2048 d::10 -> d::1	0.0012	0.0015	`0.82`
reciprocal::4096 d::10 -> d::1	0.0023	0.0023	1.02
reciprocal::1024 d::10 -> d::2	0.0011	0.0007	`1.63`
reciprocal::2048 d::10 -> d::2	0.0023	0.0014	`1.64`
reciprocal::4096 d::10 -> d::2	0.0046	0.0028	`1.63`
reciprocal::1024 d::10 -> d::10	0.0014	0.0014	1.02
reciprocal::2048 d::10 -> d::10	0.0029	0.0028	1.02
reciprocal::4096 d::10 -> d::10	0.0057	0.0056	1.02
reciprocal::1024 f::1 -> f::1	0.0003	0.0002	`1.51`
reciprocal::2048 f::1 -> f::1	0.0005	0.0004	`1.19`
reciprocal::4096 f::1 -> f::1	0.0013	0.0009	`1.48`
reciprocal::1024 f::1 -> f::2	0.0009	0.0003	`2.78`
reciprocal::2048 f::1 -> f::2	0.0017	0.0006	`2.89`
reciprocal::4096 f::1 -> f::2	0.0034	0.0014	`2.45`
reciprocal::1024 f::1 -> f::10	0.0009	0.0007	`1.3`
reciprocal::2048 f::1 -> f::10	0.0017	0.0013	`1.29`
reciprocal::4096 f::1 -> f::10	0.0034	0.0026	`1.33`
reciprocal::1024 f::2 -> f::1	0.0004	0.0003	`1.44`
reciprocal::2048 f::2 -> f::1	0.0006	0.0005	`1.2`
reciprocal::4096 f::2 -> f::1	0.0013	0.0012	`1.12`
reciprocal::1024 f::2 -> f::2	0.0009	0.0005	`1.74`
reciprocal::2048 f::2 -> f::2	0.0017	0.0010	`1.74`
reciprocal::4096 f::2 -> f::2	0.0034	0.0020	`1.7`
reciprocal::1024 f::2 -> f::10	0.0009	0.0007	`1.18`
reciprocal::2048 f::2 -> f::10	0.0017	0.0014	`1.2`
reciprocal::4096 f::2 -> f::10	0.0034	0.0028	`1.2`
reciprocal::1024 f::10 -> f::1	0.0004	0.0004	`1.07`
reciprocal::2048 f::10 -> f::1	0.0008	0.0008	1.0
reciprocal::4096 f::10 -> f::1	0.0015	0.0017	`0.91`
reciprocal::1024 f::10 -> f::2	0.0009	0.0006	`1.5`
reciprocal::2048 f::10 -> f::2	0.0017	0.0011	`1.51`
reciprocal::4096 f::10 -> f::2	0.0034	0.0022	`1.53`
reciprocal::1024 f::10 -> f::10	0.0009	0.0008	`1.09`
reciprocal::2048 f::10 -> f::10	0.0017	0.0016	`1.08`
reciprocal::4096 f::10 -> f::10	0.0034	0.0032	`1.08`
sqrt::1024 d::1 -> d::1	0.0009	0.0009	1.03
sqrt::2048 d::1 -> d::1	0.0017	0.0017	1.01
sqrt::4096 d::1 -> d::1	0.0034	0.0034	1.0
sqrt::1024 d::1 -> d::2	0.0054	0.0009	`6.33`
sqrt::2048 d::1 -> d::2	0.0108	0.0017	`6.32`
sqrt::4096 d::1 -> d::2	0.0217	0.0034	`6.35`
sqrt::1024 d::1 -> d::10	0.0054	0.0010	`5.15`
sqrt::2048 d::1 -> d::10	0.0108	0.0020	`5.3`
sqrt::4096 d::1 -> d::10	0.0216	0.0041	`5.33`
sqrt::1024 d::2 -> d::1	0.0009	0.0009	1.03
sqrt::2048 d::2 -> d::1	0.0017	0.0017	1.01
sqrt::4096 d::2 -> d::1	0.0034	0.0034	1.0
sqrt::1024 d::2 -> d::2	0.0054	0.0009	`6.33`
sqrt::2048 d::2 -> d::2	0.0108	0.0017	`6.34`
sqrt::4096 d::2 -> d::2	0.0217	0.0034	`6.35`
sqrt::1024 d::2 -> d::10	0.0054	0.0011	`4.92`
sqrt::2048 d::2 -> d::10	0.0108	0.0022	`4.97`
sqrt::4096 d::2 -> d::10	0.0217	0.0044	`4.94`
sqrt::1024 d::10 -> d::1	0.0009	0.0009	1.03
sqrt::2048 d::10 -> d::1	0.0017	0.0017	1.01
sqrt::4096 d::10 -> d::1	0.0034	0.0034	1.0
sqrt::1024 d::10 -> d::2	0.0054	0.0009	`6.32`
sqrt::2048 d::10 -> d::2	0.0108	0.0017	`6.34`
sqrt::4096 d::10 -> d::2	0.0216	0.0034	`6.33`
sqrt::1024 d::10 -> d::10	0.0054	0.0014	`3.86`
sqrt::2048 d::10 -> d::10	0.0108	0.0028	`3.87`
sqrt::4096 d::10 -> d::10	0.0217	0.0056	`3.87`
sqrt::1024 f::1 -> f::1	0.0003	0.0002	`1.52`
sqrt::2048 f::1 -> f::1	0.0005	0.0004	`1.26`
sqrt::4096 f::1 -> f::1	0.0009	0.0009	`1.1`
sqrt::1024 f::1 -> f::2	0.0037	0.0003	`12.14`
sqrt::2048 f::1 -> f::2	0.0074	0.0006	`12.54`
sqrt::4096 f::1 -> f::2	0.0148	0.0014	`10.56`
sqrt::1024 f::1 -> f::10	0.0037	0.0007	`5.51`
sqrt::2048 f::1 -> f::10	0.0074	0.0013	`5.62`
sqrt::4096 f::1 -> f::10	0.0148	0.0026	`5.66`
sqrt::1024 f::2 -> f::1	0.0004	0.0003	`1.5`
sqrt::2048 f::2 -> f::1	0.0006	0.0005	`1.27`
sqrt::4096 f::2 -> f::1	0.0013	0.0012	`1.12`
sqrt::1024 f::2 -> f::2	0.0037	0.0005	`7.36`
sqrt::2048 f::2 -> f::2	0.0074	0.0010	`7.32`
sqrt::4096 f::2 -> f::2	0.0148	0.0020	`7.33`
sqrt::1024 f::2 -> f::10	0.0037	0.0007	`5.12`
sqrt::2048 f::2 -> f::10	0.0074	0.0014	`5.21`
sqrt::4096 f::2 -> f::10	0.0148	0.0028	`5.35`
sqrt::1024 f::10 -> f::1	0.0004	0.0004	`1.07`
sqrt::2048 f::10 -> f::1	0.0008	0.0008	0.98
sqrt::4096 f::10 -> f::1	0.0015	0.0016	`0.92`
sqrt::1024 f::10 -> f::2	0.0037	0.0006	`6.42`
sqrt::2048 f::10 -> f::2	0.0074	0.0011	`6.56`
sqrt::4096 f::10 -> f::2	0.0148	0.0023	`6.47`
sqrt::1024 f::10 -> f::10	0.0037	0.0008	`4.64`
sqrt::2048 f::10 -> f::10	0.0074	0.0016	`4.66`
sqrt::4096 f::10 -> f::10	0.0148	0.0032	`4.63`
square::1024 d::1 -> d::1	0.0004	0.0002	`2.09`
square::2048 d::1 -> d::1	0.0006	0.0004	`1.66`
square::4096 d::1 -> d::1	0.0011	0.0009	`1.27`
square::1024 d::1 -> d::2	0.0009	0.0003	`2.86`
square::2048 d::1 -> d::2	0.0017	0.0009	`1.91`
square::4096 d::1 -> d::2	0.0034	0.0018	`1.93`
square::1024 d::1 -> d::10	0.0010	0.0010	0.99
square::2048 d::1 -> d::10	0.0020	0.0020	0.99
square::4096 d::1 -> d::10	0.0040	0.0041	0.99
square::1024 d::2 -> d::1	0.0006	0.0003	`2.28`
square::2048 d::2 -> d::1	0.0011	0.0007	`1.64`
square::4096 d::2 -> d::1	0.0020	0.0013	`1.56`
square::1024 d::2 -> d::2	0.0009	0.0004	`2.38`
square::2048 d::2 -> d::2	0.0017	0.0009	`1.93`
square::4096 d::2 -> d::2	0.0034	0.0018	`1.93`
square::1024 d::2 -> d::10	0.0011	0.0011	0.99
square::2048 d::2 -> d::10	0.0022	0.0022	1.0
square::4096 d::2 -> d::10	0.0044	0.0043	1.0
square::1024 d::10 -> d::1	0.0006	0.0005	`1.16`
square::2048 d::10 -> d::1	0.0012	0.0011	`1.1`
square::4096 d::10 -> d::1	0.0022	0.0021	1.01
square::1024 d::10 -> d::2	0.0009	0.0007	`1.25`
square::2048 d::10 -> d::2	0.0017	0.0014	`1.23`
square::4096 d::10 -> d::2	0.0034	0.0028	`1.22`
square::1024 d::10 -> d::10	0.0014	0.0014	1.02
square::2048 d::10 -> d::10	0.0029	0.0028	1.02
square::4096 d::10 -> d::10	0.0058	0.0062	`0.94`
square::1024 f::1 -> f::1	0.0003	0.0001	`2.72`
square::2048 f::1 -> f::1	0.0005	0.0002	`2.05`
square::4096 f::1 -> f::1	0.0008	0.0004	`1.91`
square::1024 f::1 -> f::2	0.0009	0.0003	`2.88`
square::2048 f::1 -> f::2	0.0017	0.0006	`2.91`
square::4096 f::1 -> f::2	0.0034	0.0014	`2.42`
square::1024 f::1 -> f::10	0.0009	0.0007	`1.31`
square::2048 f::1 -> f::10	0.0017	0.0013	`1.28`
square::4096 f::1 -> f::10	0.0034	0.0026	`1.32`
square::1024 f::2 -> f::1	0.0004	0.0002	`1.48`
square::2048 f::2 -> f::1	0.0006	0.0005	`1.25`
square::4096 f::2 -> f::1	0.0013	0.0012	`1.09`
square::1024 f::2 -> f::2	0.0009	0.0005	`1.86`
square::2048 f::2 -> f::2	0.0017	0.0009	`1.84`
square::4096 f::2 -> f::2	0.0034	0.0018	`1.89`
square::1024 f::2 -> f::10	0.0009	0.0007	`1.17`
square::2048 f::2 -> f::10	0.0017	0.0014	`1.19`
square::4096 f::2 -> f::10	0.0034	0.0028	`1.23`
square::1024 f::10 -> f::1	0.0004	0.0004	1.02
square::2048 f::10 -> f::1	0.0008	0.0008	0.96
square::4096 f::10 -> f::1	0.0015	0.0016	`0.93`
square::1024 f::10 -> f::2	0.0009	0.0005	`1.58`
square::2048 f::10 -> f::2	0.0017	0.0011	`1.6`
square::4096 f::10 -> f::2	0.0034	0.0021	`1.64`
square::1024 f::10 -> f::10	0.0009	0.0008	`1.08`
square::2048 f::10 -> f::10	0.0017	0.0016	`1.06`
square::4096 f::10 -> f::10	0.0034	0.0032	`1.06`

SSE3 - Contiguous only

metric: gmean, units: ms

name of test	before_sse3	after_sse3	after_sse3 vs before_sse3
absolute::1024 d::1 -> d::1	0.0004	0.0002	`2.28`
absolute::2048 d::1 -> d::1	0.0006	0.0004	`1.59`
absolute::4096 d::1 -> d::1	0.0011	0.0009	`1.19`
absolute::1024 f::1 -> f::1	0.0003	0.0001	`3.33`
absolute::2048 f::1 -> f::1	0.0004	0.0002	`2.35`
absolute::4096 f::1 -> f::1	0.0006	0.0004	`1.48`
reciprocal::1024 d::1 -> d::1	0.0006	0.0006	`1.14`
reciprocal::2048 d::1 -> d::1	0.0012	0.0011	`1.06`
reciprocal::4096 d::1 -> d::1	0.0023	0.0023	1.01
reciprocal::1024 f::1 -> f::1	0.0003	0.0002	`1.55`
reciprocal::2048 f::1 -> f::1	0.0005	0.0004	`1.16`
reciprocal::4096 f::1 -> f::1	0.0010	0.0009	`1.16`
sqrt::1024 d::1 -> d::1	0.0009	0.0009	1.03
sqrt::2048 d::1 -> d::1	0.0017	0.0017	1.01
sqrt::4096 d::1 -> d::1	0.0034	0.0034	1.0
sqrt::1024 f::1 -> f::1	0.0003	0.0002	`1.55`
sqrt::2048 f::1 -> f::1	0.0005	0.0004	`1.26`
sqrt::4096 f::1 -> f::1	0.0009	0.0009	`1.1`
square::1024 d::1 -> d::1	0.0004	0.0002	`2.11`
square::2048 d::1 -> d::1	0.0006	0.0004	`1.65`
square::4096 d::1 -> d::1	0.0011	0.0009	`1.28`
square::1024 f::1 -> f::1	0.0003	0.0001	`2.76`
square::2048 f::1 -> f::1	0.0005	0.0002	`2.04`
square::4096 f::1 -> f::1	0.0008	0.0004	`1.87`

SSE3

metric: gmean, units: ms

name of test	before_sse3	after_sse3	after_sse3 vs before_sse3
absolute::1024 d::1 -> d::1	0.0004	0.0002	`2.23`
absolute::2048 d::1 -> d::1	0.0006	0.0004	`1.58`
absolute::4096 d::1 -> d::1	0.0011	0.0009	`1.24`
absolute::1024 d::1 -> d::2	0.0012	0.0003	`3.78`
absolute::2048 d::1 -> d::2	0.0023	0.0009	`2.6`
absolute::4096 d::1 -> d::2	0.0046	0.0018	`2.59`
absolute::1024 d::1 -> d::10	0.0012	0.0010	`1.11`
absolute::2048 d::1 -> d::10	0.0023	0.0020	`1.12`
absolute::4096 d::1 -> d::10	0.0046	0.0041	`1.13`
absolute::1024 d::2 -> d::1	0.0006	0.0002	`2.63`
absolute::2048 d::2 -> d::1	0.0011	0.0007	`1.64`
absolute::4096 d::2 -> d::1	0.0021	0.0013	`1.6`
absolute::1024 d::2 -> d::2	0.0012	0.0005	`2.47`
absolute::2048 d::2 -> d::2	0.0023	0.0009	`2.44`
absolute::4096 d::2 -> d::2	0.0046	0.0018	`2.48`
absolute::1024 d::2 -> d::10	0.0012	0.0011	`1.07`
absolute::2048 d::2 -> d::10	0.0024	0.0022	`1.06`
absolute::4096 d::2 -> d::10	0.0048	0.0044	`1.09`
absolute::1024 d::10 -> d::1	0.0006	0.0005	`1.2`
absolute::2048 d::10 -> d::1	0.0011	0.0011	`1.07`
absolute::4096 d::10 -> d::1	0.0022	0.0021	1.04
absolute::1024 d::10 -> d::2	0.0012	0.0007	`1.65`
absolute::2048 d::10 -> d::2	0.0023	0.0014	`1.65`
absolute::4096 d::10 -> d::2	0.0046	0.0028	`1.64`
absolute::1024 d::10 -> d::10	0.0015	0.0014	`1.07`
absolute::2048 d::10 -> d::10	0.0029	0.0028	1.04
absolute::4096 d::10 -> d::10	0.0058	0.0056	1.04
absolute::1024 f::1 -> f::1	0.0003	0.0001	`3.83`
absolute::2048 f::1 -> f::1	0.0004	0.0002	`2.36`
absolute::4096 f::1 -> f::1	0.0006	0.0004	`1.48`
absolute::1024 f::1 -> f::2	0.0009	0.0003	`2.86`
absolute::2048 f::1 -> f::2	0.0017	0.0006	`2.89`
absolute::4096 f::1 -> f::2	0.0034	0.0014	`2.46`
absolute::1024 f::1 -> f::10	0.0009	0.0007	`1.26`
absolute::2048 f::1 -> f::10	0.0017	0.0013	`1.31`
absolute::4096 f::1 -> f::10	0.0034	0.0026	`1.3`
absolute::1024 f::2 -> f::1	0.0004	0.0003	`1.5`
absolute::2048 f::2 -> f::1	0.0006	0.0005	`1.24`
absolute::4096 f::2 -> f::1	0.0013	0.0012	`1.11`
absolute::1024 f::2 -> f::2	0.0009	0.0005	`1.94`
absolute::2048 f::2 -> f::2	0.0017	0.0009	`1.89`
absolute::4096 f::2 -> f::2	0.0034	0.0018	`1.95`
absolute::1024 f::2 -> f::10	0.0009	0.0007	`1.2`
absolute::2048 f::2 -> f::10	0.0017	0.0014	`1.22`
absolute::4096 f::2 -> f::10	0.0034	0.0028	`1.24`
absolute::1024 f::10 -> f::1	0.0005	0.0004	`1.08`
absolute::2048 f::10 -> f::1	0.0008	0.0008	1.0
absolute::4096 f::10 -> f::1	0.0015	0.0016	`0.94`
absolute::1024 f::10 -> f::2	0.0009	0.0005	`1.61`
absolute::2048 f::10 -> f::2	0.0017	0.0011	`1.63`
absolute::4096 f::10 -> f::2	0.0034	0.0021	`1.65`
absolute::1024 f::10 -> f::10	0.0009	0.0008	`1.1`
absolute::2048 f::10 -> f::10	0.0017	0.0016	`1.1`
absolute::4096 f::10 -> f::10	0.0035	0.0032	`1.1`
reciprocal::1024 d::1 -> d::1	0.0006	0.0006	`1.14`
reciprocal::2048 d::1 -> d::1	0.0012	0.0011	1.05
reciprocal::4096 d::1 -> d::1	0.0023	0.0023	1.02
reciprocal::1024 d::1 -> d::2	0.0011	0.0006	`2.0`
reciprocal::2048 d::1 -> d::2	0.0023	0.0011	`1.99`
reciprocal::4096 d::1 -> d::2	0.0046	0.0023	`2.0`
reciprocal::1024 d::1 -> d::10	0.0011	0.0010	`1.13`
reciprocal::2048 d::1 -> d::10	0.0023	0.0020	`1.13`
reciprocal::4096 d::1 -> d::10	0.0046	0.0041	`1.12`
reciprocal::1024 d::2 -> d::1	0.0007	0.0006	`1.15`
reciprocal::2048 d::2 -> d::1	0.0012	0.0011	`1.06`
reciprocal::4096 d::2 -> d::1	0.0023	0.0023	1.01
reciprocal::1024 d::2 -> d::2	0.0011	0.0006	`1.97`
reciprocal::2048 d::2 -> d::2	0.0023	0.0011	`1.98`
reciprocal::4096 d::2 -> d::2	0.0046	0.0023	`1.99`
reciprocal::1024 d::2 -> d::10	0.0011	0.0011	`1.06`
reciprocal::2048 d::2 -> d::10	0.0023	0.0022	1.05
reciprocal::4096 d::2 -> d::10	0.0046	0.0044	1.05
reciprocal::1024 d::10 -> d::1	0.0007	0.0006	`1.16`
reciprocal::2048 d::10 -> d::1	0.0012	0.0015	`0.82`
reciprocal::4096 d::10 -> d::1	0.0024	0.0023	1.03
reciprocal::1024 d::10 -> d::2	0.0011	0.0007	`1.63`
reciprocal::2048 d::10 -> d::2	0.0023	0.0014	`1.64`
reciprocal::4096 d::10 -> d::2	0.0046	0.0028	`1.63`
reciprocal::1024 d::10 -> d::10	0.0014	0.0014	1.01
reciprocal::2048 d::10 -> d::10	0.0028	0.0028	1.02
reciprocal::4096 d::10 -> d::10	0.0059	0.0056	1.04
reciprocal::1024 f::1 -> f::1	0.0003	0.0002	`1.53`
reciprocal::2048 f::1 -> f::1	0.0005	0.0004	`1.17`
reciprocal::4096 f::1 -> f::1	0.0010	0.0009	`1.18`
reciprocal::1024 f::1 -> f::2	0.0009	0.0003	`2.78`
reciprocal::2048 f::1 -> f::2	0.0017	0.0006	`2.89`
reciprocal::4096 f::1 -> f::2	0.0034	0.0014	`2.45`
reciprocal::1024 f::1 -> f::10	0.0009	0.0007	`1.3`
reciprocal::2048 f::1 -> f::10	0.0017	0.0013	`1.29`
reciprocal::4096 f::1 -> f::10	0.0034	0.0026	`1.33`
reciprocal::1024 f::2 -> f::1	0.0004	0.0003	`1.44`
reciprocal::2048 f::2 -> f::1	0.0006	0.0005	`1.2`
reciprocal::4096 f::2 -> f::1	0.0013	0.0012	`1.11`
reciprocal::1024 f::2 -> f::2	0.0009	0.0005	`1.74`
reciprocal::2048 f::2 -> f::2	0.0017	0.0010	`1.7`
reciprocal::4096 f::2 -> f::2	0.0034	0.0020	`1.7`
reciprocal::1024 f::2 -> f::10	0.0009	0.0007	`1.18`
reciprocal::2048 f::2 -> f::10	0.0017	0.0014	`1.2`
reciprocal::4096 f::2 -> f::10	0.0034	0.0028	`1.2`
reciprocal::1024 f::10 -> f::1	0.0004	0.0004	`1.06`
reciprocal::2048 f::10 -> f::1	0.0008	0.0008	0.97
reciprocal::4096 f::10 -> f::1	0.0015	0.0017	`0.89`
reciprocal::1024 f::10 -> f::2	0.0009	0.0006	`1.5`
reciprocal::2048 f::10 -> f::2	0.0017	0.0011	`1.51`
reciprocal::4096 f::10 -> f::2	0.0034	0.0022	`1.53`
reciprocal::1024 f::10 -> f::10	0.0009	0.0008	`1.09`
reciprocal::2048 f::10 -> f::10	0.0017	0.0016	`1.08`
reciprocal::4096 f::10 -> f::10	0.0034	0.0032	`1.07`
sqrt::1024 d::1 -> d::1	0.0009	0.0009	1.03
sqrt::2048 d::1 -> d::1	0.0017	0.0017	1.01
sqrt::4096 d::1 -> d::1	0.0052	0.0034	`1.53`
sqrt::1024 d::1 -> d::2	0.0054	0.0009	`6.33`
sqrt::2048 d::1 -> d::2	0.0108	0.0017	`6.32`
sqrt::4096 d::1 -> d::2	0.0217	0.0034	`6.35`
sqrt::1024 d::1 -> d::10	0.0054	0.0010	`5.15`
sqrt::2048 d::1 -> d::10	0.0108	0.0020	`5.3`
sqrt::4096 d::1 -> d::10	0.0216	0.0041	`5.33`
sqrt::1024 d::2 -> d::1	0.0009	0.0009	1.03
sqrt::2048 d::2 -> d::1	0.0017	0.0017	1.01
sqrt::4096 d::2 -> d::1	0.0034	0.0034	1.0
sqrt::1024 d::2 -> d::2	0.0054	0.0009	`6.33`
sqrt::2048 d::2 -> d::2	0.0131	0.0017	`7.67`
sqrt::4096 d::2 -> d::2	0.0216	0.0034	`6.34`
sqrt::1024 d::2 -> d::10	0.0054	0.0011	`4.92`
sqrt::2048 d::2 -> d::10	0.0108	0.0022	`4.97`
sqrt::4096 d::2 -> d::10	0.0217	0.0044	`4.94`
sqrt::1024 d::10 -> d::1	0.0009	0.0009	1.03
sqrt::2048 d::10 -> d::1	0.0017	0.0017	1.01
sqrt::4096 d::10 -> d::1	0.0034	0.0034	1.0
sqrt::1024 d::10 -> d::2	0.0054	0.0009	`6.32`
sqrt::2048 d::10 -> d::2	0.0108	0.0017	`6.34`
sqrt::4096 d::10 -> d::2	0.0216	0.0034	`6.33`
sqrt::1024 d::10 -> d::10	0.0054	0.0014	`3.86`
sqrt::2048 d::10 -> d::10	0.0108	0.0028	`3.87`
sqrt::4096 d::10 -> d::10	0.0217	0.0056	`3.87`
sqrt::1024 f::1 -> f::1	0.0003	0.0002	`1.54`
sqrt::2048 f::1 -> f::1	0.0005	0.0004	`1.25`
sqrt::4096 f::1 -> f::1	0.0009	0.0009	`1.1`
sqrt::1024 f::1 -> f::2	0.0037	0.0003	`12.14`
sqrt::2048 f::1 -> f::2	0.0074	0.0006	`12.51`
sqrt::4096 f::1 -> f::2	0.0148	0.0014	`10.57`
sqrt::1024 f::1 -> f::10	0.0037	0.0007	`5.51`
sqrt::2048 f::1 -> f::10	0.0074	0.0013	`5.62`
sqrt::4096 f::1 -> f::10	0.0148	0.0026	`5.66`
sqrt::1024 f::2 -> f::1	0.0004	0.0003	`1.5`
sqrt::2048 f::2 -> f::1	0.0006	0.0005	`1.25`
sqrt::4096 f::2 -> f::1	0.0013	0.0012	`1.12`
sqrt::1024 f::2 -> f::2	0.0037	0.0005	`7.36`
sqrt::2048 f::2 -> f::2	0.0074	0.0010	`7.32`
sqrt::4096 f::2 -> f::2	0.0148	0.0020	`7.33`
sqrt::1024 f::2 -> f::10	0.0037	0.0007	`5.12`
sqrt::2048 f::2 -> f::10	0.0074	0.0014	`5.21`
sqrt::4096 f::2 -> f::10	0.0148	0.0028	`5.35`
sqrt::1024 f::10 -> f::1	0.0004	0.0004	`1.08`
sqrt::2048 f::10 -> f::1	0.0008	0.0008	0.98
sqrt::4096 f::10 -> f::1	0.0015	0.0016	`0.93`
sqrt::1024 f::10 -> f::2	0.0037	0.0006	`6.42`
sqrt::2048 f::10 -> f::2	0.0074	0.0011	`6.56`
sqrt::4096 f::10 -> f::2	0.0148	0.0023	`6.47`
sqrt::1024 f::10 -> f::10	0.0037	0.0008	`4.64`
sqrt::2048 f::10 -> f::10	0.0074	0.0016	`4.66`
sqrt::4096 f::10 -> f::10	0.0148	0.0032	`4.63`
square::1024 d::1 -> d::1	0.0004	0.0002	`2.12`
square::2048 d::1 -> d::1	0.0006	0.0004	`1.62`
square::4096 d::1 -> d::1	0.0011	0.0009	`1.27`
square::1024 d::1 -> d::2	0.0009	0.0003	`2.86`
square::2048 d::1 -> d::2	0.0017	0.0009	`1.91`
square::4096 d::1 -> d::2	0.0034	0.0018	`1.93`
square::1024 d::1 -> d::10	0.0010	0.0010	1.01
square::2048 d::1 -> d::10	0.0020	0.0020	1.0
square::4096 d::1 -> d::10	0.0041	0.0041	0.99
square::1024 d::2 -> d::1	0.0006	0.0003	`2.31`
square::2048 d::2 -> d::1	0.0011	0.0007	`1.61`
square::4096 d::2 -> d::1	0.0021	0.0013	`1.59`
square::1024 d::2 -> d::2	0.0009	0.0004	`2.38`
square::2048 d::2 -> d::2	0.0017	0.0009	`1.93`
square::4096 d::2 -> d::2	0.0034	0.0018	`1.93`
square::1024 d::2 -> d::10	0.0011	0.0011	0.99
square::2048 d::2 -> d::10	0.0022	0.0022	1.0
square::4096 d::2 -> d::10	0.0043	0.0043	0.99
square::1024 d::10 -> d::1	0.0006	0.0005	`1.18`
square::2048 d::10 -> d::1	0.0011	0.0011	1.05
square::4096 d::10 -> d::1	0.0022	0.0021	1.04
square::1024 d::10 -> d::2	0.0009	0.0007	`1.25`
square::2048 d::10 -> d::2	0.0017	0.0014	`1.23`
square::4096 d::10 -> d::2	0.0034	0.0028	`1.22`
square::1024 d::10 -> d::10	0.0014	0.0014	1.02
square::2048 d::10 -> d::10	0.0029	0.0028	1.02
square::4096 d::10 -> d::10	0.0058	0.0062	`0.93`
square::1024 f::1 -> f::1	0.0003	0.0001	`2.68`
square::2048 f::1 -> f::1	0.0005	0.0002	`2.04`
square::4096 f::1 -> f::1	0.0008	0.0004	`1.92`
square::1024 f::1 -> f::2	0.0009	0.0003	`2.88`
square::2048 f::1 -> f::2	0.0017	0.0006	`2.91`
square::4096 f::1 -> f::2	0.0034	0.0014	`2.42`
square::1024 f::1 -> f::10	0.0009	0.0007	`1.31`
square::2048 f::1 -> f::10	0.0017	0.0013	`1.28`
square::4096 f::1 -> f::10	0.0034	0.0026	`1.32`
square::1024 f::2 -> f::1	0.0004	0.0002	`1.49`
square::2048 f::2 -> f::1	0.0006	0.0005	`1.25`
square::4096 f::2 -> f::1	0.0013	0.0012	`1.09`
square::1024 f::2 -> f::2	0.0009	0.0005	`1.86`
square::2048 f::2 -> f::2	0.0017	0.0009	`1.84`
square::4096 f::2 -> f::2	0.0034	0.0018	`1.89`
square::1024 f::2 -> f::10	0.0009	0.0007	`1.17`
square::2048 f::2 -> f::10	0.0017	0.0014	`1.19`
square::4096 f::2 -> f::10	0.0034	0.0028	`1.23`
square::1024 f::10 -> f::1	0.0004	0.0004	1.03
square::2048 f::10 -> f::1	0.0008	0.0008	`0.94`
square::4096 f::10 -> f::1	0.0015	0.0016	`0.94`
square::1024 f::10 -> f::2	0.0009	0.0005	`1.58`
square::2048 f::10 -> f::2	0.0017	0.0011	`1.6`
square::4096 f::10 -> f::2	0.0034	0.0021	`1.64`
square::1024 f::10 -> f::10	0.0009	0.0008	`1.08`
square::2048 f::10 -> f::10	0.0017	0.0016	`1.06`
square::4096 f::10 -> f::10	0.0034	0.0032	`1.06`

ARM8 64-bit

CPU

Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           1
Model name:                      Neoverse-N1
Stepping:                        r3p1
BogoMIPS:                        243.75
L1d cache:                       256 KiB
L1i cache:                       256 KiB
L2 cache:                        4 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

OS

Linux ip-172-31-6-63 5.4.0-1024-aws #24-Ubuntu SMP Sat Sep 5 06:17:48 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
gcc-7 (Ubuntu/Linaro 7.5.0-6ubuntu2) 7.5.0

Benchmark

ASIMD - Contiguous only

metric: gmean, units: ms

name of test	before_contig	after_contig	after_contig vs before_contig
absolute::1024 d::1 -> d::1	0.0011	0.0002	`4.93`
absolute::2048 d::1 -> d::1	0.0023	0.0005	`4.77`
absolute::4096 d::1 -> d::1	0.0045	0.0009	`5.0`
absolute::1024 f::1 -> f::1	0.0011	0.0001	`8.9`
absolute::2048 f::1 -> f::1	0.0023	0.0002	`9.44`
absolute::4096 f::1 -> f::1	0.0045	0.0005	`9.62`
reciprocal::1024 d::1 -> d::1	0.0020	0.0020	1.0
reciprocal::2048 d::1 -> d::1	0.0041	0.0041	1.0
reciprocal::4096 d::1 -> d::1	0.0082	0.0082	1.0
reciprocal::1024 f::1 -> f::1	0.0006	0.0006	1.0
reciprocal::2048 f::1 -> f::1	0.0012	0.0012	1.0
reciprocal::4096 f::1 -> f::1	0.0025	0.0025	1.0
sqrt::1024 d::1 -> d::1	0.0029	0.0029	1.0
sqrt::2048 d::1 -> d::1	0.0057	0.0057	1.0
sqrt::4096 d::1 -> d::1	0.0115	0.0115	1.0
sqrt::1024 f::1 -> f::1	0.0010	0.0007	`1.44`
sqrt::2048 f::1 -> f::1	0.0021	0.0014	`1.44`
sqrt::4096 f::1 -> f::1	0.0041	0.0029	`1.43`
square::1024 d::1 -> d::1	0.0004	0.0002	`1.58`
square::2048 d::1 -> d::1	0.0007	0.0004	`1.63`
square::4096 d::1 -> d::1	0.0014	0.0009	`1.61`
square::1024 f::1 -> f::1	0.0002	0.0001	`1.77`
square::2048 f::1 -> f::1	0.0004	0.0002	`1.74`
square::4096 f::1 -> f::1	0.0008	0.0005	`1.73`

ASIMD

metric: gmean, units: ms

name of test	before	after	after vs before
absolute::1024 d::1 -> d::1	0.0011	0.0002	`4.93`
absolute::2048 d::1 -> d::1	0.0023	0.0005	`5.01`
absolute::4096 d::1 -> d::1	0.0045	0.0010	`4.71`
absolute::1024 d::1 -> d::2	0.0011	0.0004	`2.7`
absolute::2048 d::1 -> d::2	0.0023	0.0008	`2.67`
absolute::4096 d::1 -> d::2	0.0047	0.0021	`2.26`
absolute::1024 d::1 -> d::10	0.0015	0.0011	`1.42`
absolute::2048 d::1 -> d::10	0.0029	0.0028	1.01
absolute::4096 d::1 -> d::10	0.0056	0.0061	`0.92`
absolute::1024 d::2 -> d::1	0.0011	0.0005	`2.17`
absolute::2048 d::2 -> d::1	0.0023	0.0010	`2.18`
absolute::4096 d::2 -> d::1	0.0046	0.0021	`2.16`
absolute::1024 d::2 -> d::2	0.0012	0.0007	`1.76`
absolute::2048 d::2 -> d::2	0.0023	0.0013	`1.77`
absolute::4096 d::2 -> d::2	0.0047	0.0030	`1.57`
absolute::1024 d::2 -> d::10	0.0015	0.0014	`1.08`
absolute::2048 d::2 -> d::10	0.0030	0.0028	`1.07`
absolute::4096 d::2 -> d::10	0.0060	0.0055	`1.09`
absolute::1024 d::10 -> d::1	0.0015	0.0007	`2.23`
absolute::2048 d::10 -> d::1	0.0030	0.0021	`1.45`
absolute::4096 d::10 -> d::1	0.0064	0.0041	`1.56`
absolute::1024 d::10 -> d::2	0.0017	0.0014	`1.21`
absolute::2048 d::10 -> d::2	0.0033	0.0028	`1.19`
absolute::4096 d::10 -> d::2	0.0066	0.0056	`1.19`
absolute::1024 d::10 -> d::10	0.0024	0.0021	`1.14`
absolute::2048 d::10 -> d::10	0.0047	0.0042	`1.13`
absolute::4096 d::10 -> d::10	0.0101	0.0084	`1.2`
absolute::1024 f::1 -> f::1	0.0011	0.0001	`8.89`
absolute::2048 f::1 -> f::1	0.0023	0.0002	`9.47`
absolute::4096 f::1 -> f::1	0.0045	0.0005	`9.63`
absolute::1024 f::1 -> f::2	0.0011	0.0004	`2.85`
absolute::2048 f::1 -> f::2	0.0023	0.0008	`2.88`
absolute::4096 f::1 -> f::2	0.0045	0.0016	`2.88`
absolute::1024 f::1 -> f::10	0.0011	0.0005	`2.52`
absolute::2048 f::1 -> f::10	0.0027	0.0019	`1.39`
absolute::4096 f::1 -> f::10	0.0049	0.0038	`1.28`
absolute::1024 f::2 -> f::1	0.0011	0.0003	`3.74`
absolute::2048 f::2 -> f::1	0.0023	0.0006	`3.86`
absolute::4096 f::2 -> f::1	0.0045	0.0012	`3.93`
absolute::1024 f::2 -> f::2	0.0011	0.0006	`1.96`
absolute::2048 f::2 -> f::2	0.0023	0.0011	`1.98`
absolute::4096 f::2 -> f::2	0.0046	0.0023	`1.98`
absolute::1024 f::2 -> f::10	0.0012	0.0006	`1.95`
absolute::2048 f::2 -> f::10	0.0025	0.0019	`1.32`
absolute::4096 f::2 -> f::10	0.0050	0.0037	`1.34`
absolute::1024 f::10 -> f::1	0.0011	0.0007	`1.7`
absolute::2048 f::10 -> f::1	0.0024	0.0016	`1.55`
absolute::4096 f::10 -> f::1	0.0048	0.0031	`1.57`
absolute::1024 f::10 -> f::2	0.0011	0.0009	`1.27`
absolute::2048 f::10 -> f::2	0.0025	0.0020	`1.2`
absolute::4096 f::10 -> f::2	0.0049	0.0040	`1.22`
absolute::1024 f::10 -> f::10	0.0014	0.0012	`1.13`
absolute::2048 f::10 -> f::10	0.0029	0.0028	1.03
absolute::4096 f::10 -> f::10	0.0058	0.0056	1.03
reciprocal::1024 d::1 -> d::1	0.0020	0.0020	1.0
reciprocal::2048 d::1 -> d::1	0.0041	0.0041	1.0
reciprocal::4096 d::1 -> d::1	0.0082	0.0082	1.0
reciprocal::1024 d::1 -> d::2	0.0020	0.0020	1.0
reciprocal::2048 d::1 -> d::2	0.0041	0.0041	1.0
reciprocal::4096 d::1 -> d::2	0.0082	0.0082	1.0
reciprocal::1024 d::1 -> d::10	0.0020	0.0020	1.0
reciprocal::2048 d::1 -> d::10	0.0041	0.0041	1.0
reciprocal::4096 d::1 -> d::10	0.0082	0.0082	1.0
reciprocal::1024 d::2 -> d::1	0.0020	0.0020	1.0
reciprocal::2048 d::2 -> d::1	0.0041	0.0041	1.0
reciprocal::4096 d::2 -> d::1	0.0082	0.0082	1.0
reciprocal::1024 d::2 -> d::2	0.0020	0.0020	1.0
reciprocal::2048 d::2 -> d::2	0.0041	0.0041	1.0
reciprocal::4096 d::2 -> d::2	0.0082	0.0082	1.0
reciprocal::1024 d::2 -> d::10	0.0021	0.0021	1.0
reciprocal::2048 d::2 -> d::10	0.0041	0.0041	1.0
reciprocal::4096 d::2 -> d::10	0.0082	0.0082	1.0
reciprocal::1024 d::10 -> d::1	0.0020	0.0020	1.0
reciprocal::2048 d::10 -> d::1	0.0041	0.0041	1.0
reciprocal::4096 d::10 -> d::1	0.0082	0.0082	1.0
reciprocal::1024 d::10 -> d::2	0.0020	0.0021	1.0
reciprocal::2048 d::10 -> d::2	0.0041	0.0041	1.0
reciprocal::4096 d::10 -> d::2	0.0082	0.0083	0.99
reciprocal::1024 d::10 -> d::10	0.0022	0.0021	1.04
reciprocal::2048 d::10 -> d::10	0.0044	0.0043	1.03
reciprocal::4096 d::10 -> d::10	0.0097	0.0087	`1.11`
reciprocal::1024 f::1 -> f::1	0.0006	0.0006	1.0
reciprocal::2048 f::1 -> f::1	0.0012	0.0012	1.0
reciprocal::4096 f::1 -> f::1	0.0025	0.0025	1.0
reciprocal::1024 f::1 -> f::2	0.0008	0.0006	`1.35`
reciprocal::2048 f::1 -> f::2	0.0016	0.0012	`1.34`
reciprocal::4096 f::1 -> f::2	0.0033	0.0025	`1.33`
reciprocal::1024 f::1 -> f::10	0.0008	0.0006	`1.35`
reciprocal::2048 f::1 -> f::10	0.0019	0.0020	0.97
reciprocal::4096 f::1 -> f::10	0.0036	0.0039	`0.94`
reciprocal::1024 f::2 -> f::1	0.0008	0.0006	`1.35`
reciprocal::2048 f::2 -> f::1	0.0016	0.0012	`1.34`
reciprocal::4096 f::2 -> f::1	0.0033	0.0025	`1.33`
reciprocal::1024 f::2 -> f::2	0.0008	0.0007	`1.27`
reciprocal::2048 f::2 -> f::2	0.0016	0.0012	`1.32`
reciprocal::4096 f::2 -> f::2	0.0033	0.0025	`1.32`
reciprocal::1024 f::2 -> f::10	0.0008	0.0006	`1.3`
reciprocal::2048 f::2 -> f::10	0.0019	0.0020	0.97
reciprocal::4096 f::2 -> f::10	0.0038	0.0039	0.96
reciprocal::1024 f::10 -> f::1	0.0008	0.0008	`1.11`
reciprocal::2048 f::10 -> f::1	0.0017	0.0017	1.02
reciprocal::4096 f::10 -> f::1	0.0034	0.0033	1.03
reciprocal::1024 f::10 -> f::2	0.0008	0.0010	`0.86`
reciprocal::2048 f::10 -> f::2	0.0018	0.0022	`0.85`
reciprocal::4096 f::10 -> f::2	0.0037	0.0044	`0.85`
reciprocal::1024 f::10 -> f::10	0.0012	0.0013	`0.92`
reciprocal::2048 f::10 -> f::10	0.0028	0.0029	0.98
reciprocal::4096 f::10 -> f::10	0.0056	0.0059	0.96
sqrt::1024 d::1 -> d::1	0.0029	0.0029	1.0
sqrt::2048 d::1 -> d::1	0.0057	0.0057	1.0
sqrt::4096 d::1 -> d::1	0.0115	0.0115	1.0
sqrt::1024 d::1 -> d::2	0.0029	0.0029	1.0
sqrt::2048 d::1 -> d::2	0.0057	0.0057	1.0
sqrt::4096 d::1 -> d::2	0.0115	0.0115	1.0
sqrt::1024 d::1 -> d::10	0.0029	0.0029	1.0
sqrt::2048 d::1 -> d::10	0.0057	0.0057	1.0
sqrt::4096 d::1 -> d::10	0.0115	0.0115	1.0
sqrt::1024 d::2 -> d::1	0.0029	0.0029	1.0
sqrt::2048 d::2 -> d::1	0.0057	0.0057	1.0
sqrt::4096 d::2 -> d::1	0.0115	0.0115	1.0
sqrt::1024 d::2 -> d::2	0.0029	0.0029	1.0
sqrt::2048 d::2 -> d::2	0.0057	0.0057	1.0
sqrt::4096 d::2 -> d::2	0.0115	0.0115	1.0
sqrt::1024 d::2 -> d::10	0.0029	0.0029	1.0
sqrt::2048 d::2 -> d::10	0.0057	0.0057	1.0
sqrt::4096 d::2 -> d::10	0.0115	0.0115	1.0
sqrt::1024 d::10 -> d::1	0.0029	0.0029	1.0
sqrt::2048 d::10 -> d::1	0.0057	0.0057	1.0
sqrt::4096 d::10 -> d::1	0.0118	0.0116	1.02
sqrt::1024 d::10 -> d::2	0.0029	0.0029	1.0
sqrt::2048 d::10 -> d::2	0.0057	0.0057	1.0
sqrt::4096 d::10 -> d::2	0.0115	0.0115	1.0
sqrt::1024 d::10 -> d::10	0.0029	0.0029	1.0
sqrt::2048 d::10 -> d::10	0.0057	0.0057	1.0
sqrt::4096 d::10 -> d::10	0.0118	0.0118	1.0
sqrt::1024 f::1 -> f::1	0.0010	0.0007	`1.45`
sqrt::2048 f::1 -> f::1	0.0021	0.0014	`1.44`
sqrt::4096 f::1 -> f::1	0.0041	0.0029	`1.43`
sqrt::1024 f::1 -> f::2	0.0010	0.0007	`1.44`
sqrt::2048 f::1 -> f::2	0.0021	0.0014	`1.44`
sqrt::4096 f::1 -> f::2	0.0041	0.0029	`1.43`
sqrt::1024 f::1 -> f::10	0.0010	0.0007	`1.45`
sqrt::2048 f::1 -> f::10	0.0024	0.0019	`1.25`
sqrt::4096 f::1 -> f::10	0.0043	0.0039	`1.12`
sqrt::1024 f::2 -> f::1	0.0010	0.0007	`1.45`
sqrt::2048 f::2 -> f::1	0.0021	0.0014	`1.44`
sqrt::4096 f::2 -> f::1	0.0041	0.0029	`1.43`
sqrt::1024 f::2 -> f::2	0.0010	0.0008	`1.33`
sqrt::2048 f::2 -> f::2	0.0021	0.0015	`1.33`
sqrt::4096 f::2 -> f::2	0.0041	0.0031	`1.34`
sqrt::1024 f::2 -> f::10	0.0010	0.0008	`1.34`
sqrt::2048 f::2 -> f::10	0.0022	0.0020	`1.13`
sqrt::4096 f::2 -> f::10	0.0044	0.0039	`1.13`
sqrt::1024 f::10 -> f::1	0.0010	0.0008	`1.27`
sqrt::2048 f::10 -> f::1	0.0021	0.0017	`1.22`
sqrt::4096 f::10 -> f::1	0.0042	0.0034	`1.22`
sqrt::1024 f::10 -> f::2	0.0010	0.0010	1.04
sqrt::2048 f::10 -> f::2	0.0021	0.0022	0.98
sqrt::4096 f::10 -> f::2	0.0043	0.0044	0.97
sqrt::1024 f::10 -> f::10	0.0013	0.0014	0.95
sqrt::2048 f::10 -> f::10	0.0028	0.0029	0.99
sqrt::4096 f::10 -> f::10	0.0057	0.0057	0.99
square::1024 d::1 -> d::1	0.0004	0.0002	`1.58`
square::2048 d::1 -> d::1	0.0007	0.0005	`1.57`
square::4096 d::1 -> d::1	0.0014	0.0009	`1.61`
square::1024 d::1 -> d::2	0.0008	0.0004	`1.99`
square::2048 d::1 -> d::2	0.0017	0.0008	`1.96`
square::4096 d::1 -> d::2	0.0033	0.0022	`1.53`
square::1024 d::1 -> d::10	0.0011	0.0010	`1.07`
square::2048 d::1 -> d::10	0.0025	0.0025	1.01
square::4096 d::1 -> d::10	0.0051	0.0055	`0.93`
square::1024 d::2 -> d::1	0.0008	0.0005	`1.62`
square::2048 d::2 -> d::1	0.0017	0.0010	`1.63`
square::4096 d::2 -> d::1	0.0033	0.0021	`1.58`
square::1024 d::2 -> d::2	0.0008	0.0007	`1.28`
square::2048 d::2 -> d::2	0.0017	0.0013	`1.25`
square::4096 d::2 -> d::2	0.0033	0.0030	`1.1`
square::1024 d::2 -> d::10	0.0013	0.0013	0.99
square::2048 d::2 -> d::10	0.0027	0.0027	0.99
square::4096 d::2 -> d::10	0.0053	0.0053	0.99
square::1024 d::10 -> d::1	0.0010	0.0005	`1.96`
square::2048 d::10 -> d::1	0.0024	0.0021	`1.15`
square::4096 d::10 -> d::1	0.0046	0.0040	`1.15`
square::1024 d::10 -> d::2	0.0015	0.0014	`1.11`
square::2048 d::10 -> d::2	0.0030	0.0028	`1.09`
square::4096 d::10 -> d::2	0.0062	0.0055	`1.12`
square::1024 d::10 -> d::10	0.0022	0.0021	`1.06`
square::2048 d::10 -> d::10	0.0044	0.0042	1.05
square::4096 d::10 -> d::10	0.0096	0.0091	1.05
square::1024 f::1 -> f::1	0.0002	0.0001	`1.77`
square::2048 f::1 -> f::1	0.0004	0.0002	`1.74`
square::4096 f::1 -> f::1	0.0008	0.0005	`1.72`
square::1024 f::1 -> f::2	0.0008	0.0004	`2.07`
square::2048 f::1 -> f::2	0.0017	0.0008	`2.08`
square::4096 f::1 -> f::2	0.0033	0.0016	`2.07`
square::1024 f::1 -> f::10	0.0008	0.0004	`1.89`
square::2048 f::1 -> f::10	0.0018	0.0019	0.95
square::4096 f::1 -> f::10	0.0036	0.0038	`0.94`
square::1024 f::2 -> f::1	0.0008	0.0003	`2.79`
square::2048 f::2 -> f::1	0.0017	0.0006	`2.85`
square::4096 f::2 -> f::1	0.0033	0.0011	`2.88`
square::1024 f::2 -> f::2	0.0008	0.0006	`1.43`
square::2048 f::2 -> f::2	0.0017	0.0011	`1.45`
square::4096 f::2 -> f::2	0.0033	0.0023	`1.43`
square::1024 f::2 -> f::10	0.0008	0.0006	`1.42`
square::2048 f::2 -> f::10	0.0019	0.0018	1.03
square::4096 f::2 -> f::10	0.0037	0.0037	1.01
square::1024 f::10 -> f::1	0.0008	0.0007	`1.22`
square::2048 f::10 -> f::1	0.0017	0.0016	`1.08`
square::4096 f::10 -> f::1	0.0033	0.0031	`1.07`
square::1024 f::10 -> f::2	0.0008	0.0009	`0.93`
square::2048 f::10 -> f::2	0.0018	0.0020	`0.87`
square::4096 f::10 -> f::2	0.0035	0.0041	`0.87`
square::1024 f::10 -> f::10	0.0012	0.0012	`0.93`
square::2048 f::10 -> f::10	0.0028	0.0028	0.98
square::4096 f::10 -> f::10	0.0055	0.0057	0.97

Power little-endian

CPU

Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Model:                           2.2 (pvr 004e 1202)
Model name:                      POWER9 (architected), altivec supported
L1d cache:                       256 KiB
L1i cache:                       256 KiB
NUMA node0 CPU(s):               0-7
Vulnerability L1tf:              Not affected
Vulnerability Meltdown:          Mitigation; RFI Flush
Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable

processor	: 7
cpu		: POWER9 (architected), altivec supported
clock		: 2200.000000MHz
revision	: 2.2 (pvr 004e 1202)

timebase	: 512000000
platform	: pSeries
model		: IBM pSeries (emulated by qemu)
machine		: CHRP IBM pSeries (emulated by qemu)
MMU		: Radix

OS

Linux 8b2db3b0dfac 4.19.0-2-powerpc64le
gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)

Benchmark

VSX2(ISA >= 2.07) - Contiguous only

metric: gmean, units: ms

name of test	before_contig	after_contig	after_contig vs before_contig
absolute::1024 d::1 -> d::1	0.0009	0.0003	`3.11`
absolute::2048 d::1 -> d::1	0.0017	0.0006	`2.88`
absolute::4096 d::1 -> d::1	0.0033	0.0011	`2.98`
absolute::1024 f::1 -> f::1	0.0011	0.0002	`6.5`
absolute::2048 f::1 -> f::1	0.0021	0.0003	`6.96`
absolute::4096 f::1 -> f::1	0.0041	0.0006	`6.92`
reciprocal::1024 d::1 -> d::1	0.0017	0.0016	1.05
reciprocal::2048 d::1 -> d::1	0.0033	0.0033	1.02
reciprocal::4096 d::1 -> d::1	0.0068	0.0065	1.04
reciprocal::1024 f::1 -> f::1	0.0008	0.0007	1.05
reciprocal::2048 f::1 -> f::1	0.0015	0.0014	`1.08`
reciprocal::4096 f::1 -> f::1	0.0032	0.0029	`1.13`
sqrt::1024 d::1 -> d::1	0.0028	0.0022	`1.26`
sqrt::2048 d::1 -> d::1	0.0054	0.0045	`1.21`
sqrt::4096 d::1 -> d::1	0.0109	0.0090	`1.22`
sqrt::1024 f::1 -> f::1	0.0021	0.0008	`2.55`
sqrt::2048 f::1 -> f::1	0.0042	0.0016	`2.55`
sqrt::4096 f::1 -> f::1	0.0083	0.0033	`2.54`
square::1024 d::1 -> d::1	0.0006	0.0003	`1.89`
square::2048 d::1 -> d::1	0.0011	0.0006	`1.83`
square::4096 d::1 -> d::1	0.0023	0.0012	`1.87`
square::1024 f::1 -> f::1	0.0003	0.0002	`1.76`
square::2048 f::1 -> f::1	0.0006	0.0003	`1.92`
square::4096 f::1 -> f::1	0.0011	0.0006	`1.84`

VSX2(ISA >= 2.07)

metric: gmean, units: ms

name of test	before	after	after vs before
absolute::1024 d::1 -> d::1	0.0008	0.0003	`2.9`
absolute::2048 d::1 -> d::1	0.0016	0.0006	`2.81`
absolute::4096 d::1 -> d::1	0.0032	0.0011	`2.79`
absolute::1024 d::1 -> d::2	0.0008	0.0004	`1.87`
absolute::2048 d::1 -> d::2	0.0016	0.0009	`1.88`
absolute::4096 d::1 -> d::2	0.0032	0.0017	`1.82`
absolute::1024 d::1 -> d::10	0.0008	0.0004	`1.86`
absolute::2048 d::1 -> d::10	0.0016	0.0009	`1.88`
absolute::4096 d::1 -> d::10	0.0032	0.0017	`1.82`
absolute::1024 d::2 -> d::1	0.0008	0.0004	`2.06`
absolute::2048 d::2 -> d::1	0.0016	0.0008	`1.97`
absolute::4096 d::2 -> d::1	0.0032	0.0016	`2.04`
absolute::1024 d::2 -> d::2	0.0008	0.0007	`1.24`
absolute::2048 d::2 -> d::2	0.0016	0.0013	`1.22`
absolute::4096 d::2 -> d::2	0.0032	0.0025	`1.26`
absolute::1024 d::2 -> d::10	0.0008	0.0007	`1.28`
absolute::2048 d::2 -> d::10	0.0016	0.0013	`1.23`
absolute::4096 d::2 -> d::10	0.0033	0.0025	`1.29`
absolute::1024 d::10 -> d::1	0.0009	0.0006	`1.53`
absolute::2048 d::10 -> d::1	0.0018	0.0012	`1.54`
absolute::4096 d::10 -> d::1	0.0035	0.0023	`1.53`
absolute::1024 d::10 -> d::2	0.0009	0.0008	`1.16`
absolute::2048 d::10 -> d::2	0.0018	0.0016	`1.1`
absolute::4096 d::10 -> d::2	0.0035	0.0031	`1.1`
absolute::1024 d::10 -> d::10	0.0009	0.0008	`1.11`
absolute::2048 d::10 -> d::10	0.0018	0.0016	`1.1`
absolute::4096 d::10 -> d::10	0.0072	0.0072	1.0
absolute::1024 f::1 -> f::1	0.0010	0.0002	`6.01`
absolute::2048 f::1 -> f::1	0.0019	0.0003	`6.4`
absolute::4096 f::1 -> f::1	0.0038	0.0006	`6.37`
absolute::1024 f::1 -> f::2	0.0010	0.0004	`2.62`
absolute::2048 f::1 -> f::2	0.0019	0.0007	`2.69`
absolute::4096 f::1 -> f::2	0.0039	0.0014	`2.85`
absolute::1024 f::1 -> f::10	0.0010	0.0004	`2.66`
absolute::2048 f::1 -> f::10	0.0019	0.0007	`2.77`
absolute::4096 f::1 -> f::10	0.0037	0.0013	`2.82`
absolute::1024 f::2 -> f::1	0.0010	0.0004	`2.7`
absolute::2048 f::2 -> f::1	0.0019	0.0007	`2.81`
absolute::4096 f::2 -> f::1	0.0039	0.0014	`2.83`
absolute::1024 f::2 -> f::2	0.0010	0.0006	`1.56`
absolute::2048 f::2 -> f::2	0.0019	0.0012	`1.63`
absolute::4096 f::2 -> f::2	0.0038	0.0023	`1.62`
absolute::1024 f::2 -> f::10	0.0010	0.0006	`1.61`
absolute::2048 f::2 -> f::10	0.0019	0.0012	`1.63`
absolute::4096 f::2 -> f::10	0.0038	0.0023	`1.62`
absolute::1024 f::10 -> f::1	0.0010	0.0005	`2.04`
absolute::2048 f::10 -> f::1	0.0019	0.0009	`2.16`
absolute::4096 f::10 -> f::1	0.0037	0.0017	`2.2`
absolute::1024 f::10 -> f::2	0.0010	0.0006	`1.54`
absolute::2048 f::10 -> f::2	0.0019	0.0012	`1.58`
absolute::4096 f::10 -> f::2	0.0037	0.0024	`1.57`
absolute::1024 f::10 -> f::10	0.0010	0.0006	`1.54`
absolute::2048 f::10 -> f::10	0.0019	0.0012	`1.56`
absolute::4096 f::10 -> f::10	0.0037	0.0024	`1.57`
reciprocal::1024 d::1 -> d::1	0.0016	0.0016	1.0
reciprocal::2048 d::1 -> d::1	0.0033	0.0033	1.0
reciprocal::4096 d::1 -> d::1	0.0065	0.0065	1.0
reciprocal::1024 d::1 -> d::2	0.0021	0.0016	`1.26`
reciprocal::2048 d::1 -> d::2	0.0042	0.0033	`1.27`
reciprocal::4096 d::1 -> d::2	0.0083	0.0066	`1.27`
reciprocal::1024 d::1 -> d::10	0.0021	0.0016	`1.26`
reciprocal::2048 d::1 -> d::10	0.0042	0.0033	`1.27`
reciprocal::4096 d::1 -> d::10	0.0083	0.0066	`1.27`
reciprocal::1024 d::2 -> d::1	0.0021	0.0016	`1.28`
reciprocal::2048 d::2 -> d::1	0.0042	0.0033	`1.28`
reciprocal::4096 d::2 -> d::1	0.0083	0.0068	`1.23`
reciprocal::1024 d::2 -> d::2	0.0022	0.0017	`1.31`
reciprocal::2048 d::2 -> d::2	0.0040	0.0033	`1.2`
reciprocal::4096 d::2 -> d::2	0.0080	0.0067	`1.2`
reciprocal::1024 d::2 -> d::10	0.0021	0.0017	`1.24`
reciprocal::2048 d::2 -> d::10	0.0042	0.0033	`1.25`
reciprocal::4096 d::2 -> d::10	0.0083	0.0067	`1.25`
reciprocal::1024 d::10 -> d::1	0.0021	0.0016	`1.28`
reciprocal::2048 d::10 -> d::1	0.0043	0.0033	`1.32`
reciprocal::4096 d::10 -> d::1	0.0085	0.0065	`1.29`
reciprocal::1024 d::10 -> d::2	0.0021	0.0017	`1.24`
reciprocal::2048 d::10 -> d::2	0.0042	0.0034	`1.24`
reciprocal::4096 d::10 -> d::2	0.0083	0.0067	`1.25`
reciprocal::1024 d::10 -> d::10	0.0020	0.0017	`1.2`
reciprocal::2048 d::10 -> d::10	0.0040	0.0033	`1.19`
reciprocal::4096 d::10 -> d::10	0.0081	0.0071	`1.15`
reciprocal::1024 f::1 -> f::1	0.0007	0.0007	1.02
reciprocal::2048 f::1 -> f::1	0.0015	0.0014	1.03
reciprocal::4096 f::1 -> f::1	0.0030	0.0029	1.05
reciprocal::1024 f::1 -> f::2	0.0017	0.0008	`2.06`
reciprocal::2048 f::1 -> f::2	0.0034	0.0016	`2.07`
reciprocal::4096 f::1 -> f::2	0.0068	0.0033	`2.08`
reciprocal::1024 f::1 -> f::10	0.0017	0.0008	`2.06`
reciprocal::2048 f::1 -> f::10	0.0034	0.0018	`1.91`
reciprocal::4096 f::1 -> f::10	0.0068	0.0033	`2.05`
reciprocal::1024 f::2 -> f::1	0.0017	0.0007	`2.38`
reciprocal::2048 f::2 -> f::1	0.0034	0.0014	`2.38`
reciprocal::4096 f::2 -> f::1	0.0068	0.0029	`2.38`
reciprocal::1024 f::2 -> f::2	0.0017	0.0009	`1.87`
reciprocal::2048 f::2 -> f::2	0.0034	0.0018	`1.88`
reciprocal::4096 f::2 -> f::2	0.0068	0.0036	`1.88`
reciprocal::1024 f::2 -> f::10	0.0017	0.0009	`1.87`
reciprocal::2048 f::2 -> f::10	0.0034	0.0018	`1.9`
reciprocal::4096 f::2 -> f::10	0.0068	0.0036	`1.88`
reciprocal::1024 f::10 -> f::1	0.0017	0.0007	`2.37`
reciprocal::2048 f::10 -> f::1	0.0034	0.0014	`2.38`
reciprocal::4096 f::10 -> f::1	0.0068	0.0029	`2.38`
reciprocal::1024 f::10 -> f::2	0.0017	0.0009	`1.83`
reciprocal::2048 f::10 -> f::2	0.0034	0.0019	`1.84`
reciprocal::4096 f::10 -> f::2	0.0068	0.0037	`1.86`
reciprocal::1024 f::10 -> f::10	0.0017	0.0009	`1.83`
reciprocal::2048 f::10 -> f::10	0.0034	0.0019	`1.85`
reciprocal::4096 f::10 -> f::10	0.0068	0.0037	`1.86`
sqrt::1024 d::1 -> d::1	0.0026	0.0022	`1.14`
sqrt::2048 d::1 -> d::1	0.0051	0.0045	`1.14`
sqrt::4096 d::1 -> d::1	0.0101	0.0090	`1.13`
sqrt::1024 d::1 -> d::2	0.0026	0.0023	`1.11`
sqrt::2048 d::1 -> d::2	0.0051	0.0046	`1.11`
sqrt::4096 d::1 -> d::2	0.0102	0.0092	`1.11`
sqrt::1024 d::1 -> d::10	0.0026	0.0023	`1.11`
sqrt::2048 d::1 -> d::10	0.0051	0.0046	`1.1`
sqrt::4096 d::1 -> d::10	0.0102	0.0092	`1.1`
sqrt::1024 d::2 -> d::1	0.0026	0.0023	`1.13`
sqrt::2048 d::2 -> d::1	0.0051	0.0045	`1.14`
sqrt::4096 d::2 -> d::1	0.0102	0.0090	`1.13`
sqrt::1024 d::2 -> d::2	0.0026	0.0023	`1.13`
sqrt::2048 d::2 -> d::2	0.0051	0.0045	`1.13`
sqrt::4096 d::2 -> d::2	0.0102	0.0091	`1.12`
sqrt::1024 d::2 -> d::10	0.0026	0.0023	`1.13`
sqrt::2048 d::2 -> d::10	0.0051	0.0045	`1.12`
sqrt::4096 d::2 -> d::10	0.0102	0.0091	`1.12`
sqrt::1024 d::10 -> d::1	0.0026	0.0022	`1.15`
sqrt::2048 d::10 -> d::1	0.0051	0.0045	`1.14`
sqrt::4096 d::10 -> d::1	0.0102	0.0090	`1.13`
sqrt::1024 d::10 -> d::2	0.0026	0.0023	`1.13`
sqrt::2048 d::10 -> d::2	0.0051	0.0046	`1.12`
sqrt::4096 d::10 -> d::2	0.0102	0.0091	`1.12`
sqrt::1024 d::10 -> d::10	0.0026	0.0023	`1.13`
sqrt::2048 d::10 -> d::10	0.0051	0.0048	`1.07`
sqrt::4096 d::10 -> d::10	0.0102	0.0091	`1.12`
sqrt::1024 f::1 -> f::1	0.0021	0.0008	`2.53`
sqrt::2048 f::1 -> f::1	0.0041	0.0016	`2.52`
sqrt::4096 f::1 -> f::1	0.0082	0.0033	`2.51`
sqrt::1024 f::1 -> f::2	0.0021	0.0009	`2.26`
sqrt::2048 f::1 -> f::2	0.0041	0.0018	`2.27`
sqrt::4096 f::1 -> f::2	0.0082	0.0036	`2.27`
sqrt::1024 f::1 -> f::10	0.0021	0.0009	`2.28`
sqrt::2048 f::1 -> f::10	0.0041	0.0018	`2.27`
sqrt::4096 f::1 -> f::10	0.0086	0.0036	`2.39`
sqrt::1024 f::2 -> f::1	0.0021	0.0008	`2.52`
sqrt::2048 f::2 -> f::1	0.0041	0.0016	`2.51`
sqrt::4096 f::2 -> f::1	0.0082	0.0033	`2.51`
sqrt::1024 f::2 -> f::2	0.0021	0.0010	`2.06`
sqrt::2048 f::2 -> f::2	0.0041	0.0020	`2.07`
sqrt::4096 f::2 -> f::2	0.0082	0.0041	`2.0`
sqrt::1024 f::2 -> f::10	0.0021	0.0010	`2.03`
sqrt::2048 f::2 -> f::10	0.0041	0.0020	`2.07`
sqrt::4096 f::2 -> f::10	0.0082	0.0040	`2.06`
sqrt::1024 f::10 -> f::1	0.0021	0.0008	`2.52`
sqrt::2048 f::10 -> f::1	0.0041	0.0017	`2.44`
sqrt::4096 f::10 -> f::1	0.0082	0.0033	`2.48`
sqrt::1024 f::10 -> f::2	0.0021	0.0010	`2.03`
sqrt::2048 f::10 -> f::2	0.0041	0.0020	`2.04`
sqrt::4096 f::10 -> f::2	0.0082	0.0040	`2.05`
sqrt::1024 f::10 -> f::10	0.0021	0.0011	`1.94`
sqrt::2048 f::10 -> f::10	0.0041	0.0021	`1.93`
sqrt::4096 f::10 -> f::10	0.0082	0.0041	`2.03`
square::1024 d::1 -> d::1	0.0005	0.0003	`1.8`
square::2048 d::1 -> d::1	0.0011	0.0006	`1.75`
square::4096 d::1 -> d::1	0.0021	0.0012	`1.76`
square::1024 d::1 -> d::2	0.0007	0.0005	`1.43`
square::2048 d::1 -> d::2	0.0013	0.0009	`1.44`
square::4096 d::1 -> d::2	0.0027	0.0019	`1.44`
square::1024 d::1 -> d::10	0.0007	0.0005	`1.43`
square::2048 d::1 -> d::10	0.0013	0.0009	`1.44`
square::4096 d::1 -> d::10	0.0027	0.0019	`1.4`
square::1024 d::2 -> d::1	0.0007	0.0004	`1.73`
square::2048 d::2 -> d::1	0.0015	0.0008	`1.84`
square::4096 d::2 -> d::1	0.0030	0.0016	`1.89`
square::1024 d::2 -> d::2	0.0008	0.0007	`1.09`
square::2048 d::2 -> d::2	0.0015	0.0014	`1.06`
square::4096 d::2 -> d::2	0.0030	0.0027	`1.1`
square::1024 d::2 -> d::10	0.0008	0.0007	`1.09`
square::2048 d::2 -> d::10	0.0015	0.0014	`1.09`
square::4096 d::2 -> d::10	0.0029	0.0028	1.05
square::1024 d::10 -> d::1	0.0009	0.0006	`1.47`
square::2048 d::10 -> d::1	0.0017	0.0012	`1.48`
square::4096 d::10 -> d::1	0.0033	0.0023	`1.45`
square::1024 d::10 -> d::2	0.0008	0.0008	0.99
square::2048 d::10 -> d::2	0.0016	0.0016	0.98
square::4096 d::10 -> d::2	0.0032	0.0033	0.99
square::1024 d::10 -> d::10	0.0008	0.0008	0.98
square::2048 d::10 -> d::10	0.0016	0.0016	0.98
square::4096 d::10 -> d::10	0.0071	0.0071	1.0
square::1024 f::1 -> f::1	0.0003	0.0002	`1.74`
square::2048 f::1 -> f::1	0.0005	0.0003	`1.77`
square::4096 f::1 -> f::1	0.0011	0.0006	`1.82`
square::1024 f::1 -> f::2	0.0008	0.0004	`2.14`
square::2048 f::1 -> f::2	0.0016	0.0007	`2.18`
square::4096 f::1 -> f::2	0.0032	0.0014	`2.2`
square::1024 f::1 -> f::10	0.0008	0.0004	`2.15`
square::2048 f::1 -> f::10	0.0016	0.0007	`2.19`
square::4096 f::1 -> f::10	0.0032	0.0014	`2.21`
square::1024 f::2 -> f::1	0.0008	0.0003	`2.32`
square::2048 f::2 -> f::1	0.0016	0.0007	`2.39`
square::4096 f::2 -> f::1	0.0032	0.0013	`2.41`
square::1024 f::2 -> f::2	0.0008	0.0006	`1.27`
square::2048 f::2 -> f::2	0.0016	0.0013	`1.28`
square::4096 f::2 -> f::2	0.0032	0.0025	`1.29`
square::1024 f::2 -> f::10	0.0008	0.0006	`1.27`
square::2048 f::2 -> f::10	0.0016	0.0013	`1.28`
square::4096 f::2 -> f::10	0.0034	0.0025	`1.34`
square::1024 f::10 -> f::1	0.0008	0.0004	`1.93`
square::2048 f::10 -> f::1	0.0016	0.0008	`1.94`
square::4096 f::10 -> f::1	0.0032	0.0017	`1.95`
square::1024 f::10 -> f::2	0.0008	0.0007	`1.23`
square::2048 f::10 -> f::2	0.0016	0.0013	`1.25`
square::4096 f::10 -> f::2	0.0032	0.0026	`1.25`
square::1024 f::10 -> f::10	0.0008	0.0007	`1.23`
square::2048 f::10 -> f::10	0.0016	0.0013	`1.26`
square::4096 f::10 -> f::10	0.0032	0.0026	`1.25`

Binary size of `_multiarray_umath.cpython-ver-arch-linux-gnu.so` in kbytes

Note: Debugging symbols are striped

arch	before	after	after vs before
x86	3564	3568	1.0011
ppc64le	4216	4224	1.0018
aarch64	3488	3500	1.0034

EDIT: left some notes

mattip · 2020-08-19T10:46:41Z

@seiko2plus this needs a redo now that the infrastructure is in place.

charris · 2020-09-08T02:41:10Z

@seiko2plus Ping. Looks like the recent header change merges reguires a rebase. Any possibility of breaking out the fixes required for the 32 bit wheel builds?

seiko2plus · 2020-09-08T10:18:06Z

@charris, I'm going to rebase it and push the local changes just after getting done from #16782,
since this pr and also #16960 requires implementing a lot of intrinsics that deal with non-contiguous memory access,
and honestly, I wouldn't trust my code without a proper testing unit that works across all supported SIMD extensions.

seiko2plus · 2020-09-08T10:21:03Z

@charris, Any possibility of breaking out the fixes required for the 32 bit wheel builds?

no, I have to implement a SIMD kernel that handles scalars as well as vectors, just give me a week.

numpy/core/src/umath/loops_unary_fp.dispatch.c.src

Qiyu8 · 2020-10-27T01:42:12Z

numpy/core/src/common/simd/neon/math.h

The OpenCV simd sqrt is here. but you are right, if we didn't check zero, the result will become a nan.

thanks, to #16782, I catched two issues here positive infinity(same as zero) and precision.
unfortunately, I had to add a third Newton-Raphson iteration to provides acceptable precision.

Can you add a test for that failure?

I made a check to all sqrt special cases, see:
https://github.com/numpy/numpy/blob/1f872658984b2f8b0fda7022e72ad333a62864f3/numpy/core/tests/test_simd.py#L184-L188

I made another fix to npyv_sqrt_f32() within the last push, which fixes floating-point division-by-zero error that
raised by vrsqrteq_f32(x) when x is zero.

Now all the tests passed on armhf.

mattip · 2020-10-31T16:31:57Z

numpy/core/code_generators/generate_umath.py

This will fix the failing 32-bit wheel build?

This change activates the new dispatcher.
32-bit wheel build fails due to aggressive optimization gcc made that doesn't respect zero division,
this issue was exist also on 64-bit when AVX2 and AVX512F aren't enabled.

the fix mainly here:
https://github.com/numpy/numpy/blob/1f872658984b2f8b0fda7022e72ad333a62864f3/numpy/core/src/umath/loops_unary_fp.dispatch.c.src#L129-L133

where partial load intrinsic npyv_load_till_* and npyv_loadn_till_* guarantee adding "one" to the tail of the vector.

I also made a slight change in the last push to generate using NPYV version for overlapped arrays, also to guarantee the same precsion on armhf.

https://github.com/numpy/numpy/blob/1f872658984b2f8b0fda7022e72ad333a62864f3/numpy/core/src/umath/loops_unary_fp.dispatch.c.src#L202-L206

seiko2plus · 2020-10-31T20:50:58Z

numpy/core/src/umath/loops_unary_fp.dispatch.c.src

@r-devulap, I only enabled SSE and dropped AVX2 and AVX512F since there's no performance gain for contiguous arrays,
also, the emulated version of partial and non-contiguous memory load/store intrinsics show better performance
comparing with the gather/scatter(AVX512F) intrinsics, especially when I unroll by x2/x4.

mattip

I guess we should run benchmarks to compare this to the current code. I would expect x86_64 to be no slower, and arm64 to be faster. @Qiyu8 any thoughts?

mattip · 2020-11-01T22:25:07Z

numpy/core/src/common/npy_cpu_features.h

I am curious why the shift in order is needed, usually Python.h comes first

to suppress warning 'declaration of 'struct timespec*', this compiler warnning raised when math.h get included before
Python.h.

Do you have some more info about this you can link to?

mattip · 2020-11-01T22:26:08Z

numpy/core/src/common/simd/avx2/math.h

Do the new intrinsics need to be added to the _simd module explicitly or is it automatic?

explicitly, please take a look at :

https://github.com/numpy/numpy/blob/1f872658984b2f8b0fda7022e72ad333a62864f3/numpy/core/src/_simd/_simd.dispatch.c.src#L358-L364

https://github.com/numpy/numpy/blob/1f872658984b2f8b0fda7022e72ad333a62864f3/numpy/core/src/_simd/_simd.dispatch.c.src#L503-L509

mattip · 2020-11-01T22:30:47Z

numpy/core/src/umath/loops_utils.h

We have solve_may_share_memory, in common/mem_overlap.c, can we use this opportunity to refactor the code to use it? If not, then this function should be moved to that file.

apparently solve may share_memory() is used to resolve arrays overlapping from Python level,
while what we do here is to avoid perform SIMD vector operations on overlapped arrays,
since the user expected scalar by scalar overlap which acts differently than unrolling,
also, we use sccatter/gather operations which may lead to undefined behaviour.

If not, then this function should be moved to that file.

I don't think this function is related to the content of common/mem_overlap.c.

mattip · 2020-11-01T22:34:36Z

numpy/core/tests/test_simd.py

nice, the power of _simd to show the intrinsics are correct is compelling.

it should be very helpful for the upcoming architectures.

seiko2plus · 2020-11-01T23:55:02Z

@mattip,

I guess we should run benchmarks to compare this to the current code. I would expect x86_64 to be no slower, and arm64 to be faster. @Qiyu8 any thoughts?

I'm already provided good benchmarks in the description of this pull-request based on #15987, the last changes I made
wouldn't affect the results. NOTE: armhf's benchmark isn't included only aarch64(ARMv8 64-bit)

Qiyu8 · 2020-11-02T01:44:51Z

IMO，A standalone benchmark script for universal intrinsics is the performance assurance in technical perspective, but what @mattip cared about is the final performance that numpy user can awared, which should be measured by asv benchmark script.

mattip · 2020-11-03T11:34:29Z

@seiko2plus this now has a merge conflict

what @mattip cared about is the final performance .. measured by asv

Yes, this is what we now need to verify these changes have not impacted x86_64 in a bad way. I would do it, but I don't seem to be able to stabilize my machine enough to get consistent results.

this patch also improves division precision for NEON/A32

- only covers sqrt, absolute, square and reciprocal - fix SIMD memory overlap check for aliasing(same ptr & stride) - unify fp/domain errors for both scalars and vectors

… timespec*'

seiko2plus · 2020-11-03T14:07:21Z

@mattip,

Yes, this is what we now need to verify these changes have not impacted x86_64 in a bad way.

That actually the exact reason behind #15987, it only compares the inner loops of ufunc which reduce the number of outliers
that may be caused by the Python API.

I would do it, but I don't seem to be able to stabilize my machine enough to get consistent results.

try to stabilize your system via pyperf module

sudo python -m pyperf system tune

You will need to compare against multiple dispatched targets, you gonna have to use the environment variable NPY_DISABLE_CPU_FEATURES.

for example:

# disable AVX512F to benchmarking AVX2
export NPY_DISABLE_CPU_FEATURES="AVX512F"
# run asv or #15987 
# disable AVX512F and AVX2 to benchmarking SSE
export NPY_DISABLE_CPU_FEATURES="AVX2 AVX512F"
# run asv or #15987

mattip · 2020-11-04T08:06:51Z

sudo python -m pyperf system tune

Doesn't do much on an AMD machine

mattip · 2020-11-04T19:22:24Z

@hameerabbasi could you benchmark this?

hameerabbasi · 2020-11-09T14:11:21Z

I originally posted this on #15987 by mistake:

I ran this PR on a live environment without a desktop (Ubuntu Server), using the method in the PR description. The noise was around 3% and this PR had a performance impact of ±5%, so not too much of a difference.

mattip

Thanks @hameerabbasi. So it seems this is good to be merged: performance on x86_64 is unchanged, and this unlocks universal intrinsics for the other architectures (and should solve that pesky test failure on old gcc on 32-bit linux).

mattip · 2020-11-10T10:43:29Z

Thanks @seiko2plus

seiko2plus force-pushed the to_npyv_unaryfp_g0 branch 2 times, most recently from 225ef80 to f1759d6 Compare May 17, 2020 07:23

seiko2plus force-pushed the to_npyv_unaryfp_g0 branch from f1759d6 to 043c07d Compare May 28, 2020 08:09

mattip added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Aug 19, 2020

seiko2plus mentioned this pull request Aug 31, 2020

AVX test failures for 32 bit manylinux1 wheels #17174

Closed

seiko2plus force-pushed the to_npyv_unaryfp_g0 branch 5 times, most recently from f128eca to 2688270 Compare September 21, 2020 14:03

mattip reviewed Sep 21, 2020

View reviewed changes

numpy/core/src/umath/loops_unary_fp.dispatch.c.src Outdated Show resolved Hide resolved

seiko2plus force-pushed the to_npyv_unaryfp_g0 branch 9 times, most recently from 9387b54 to a75caf1 Compare September 24, 2020 17:39

seiko2plus commented Sep 24, 2020

View reviewed changes

numpy/core/src/umath/loops_unary_fp.dispatch.c.src Outdated Show resolved Hide resolved

seiko2plus commented Sep 24, 2020

View reviewed changes

numpy/core/src/umath/loops_unary_fp.dispatch.c.src Outdated Show resolved Hide resolved

seiko2plus force-pushed the to_npyv_unaryfp_g0 branch 3 times, most recently from bc8f69e to 5ca9489 Compare September 25, 2020 05:19

seiko2plus commented Sep 25, 2020

View reviewed changes

numpy/core/src/umath/loops_unary_fp.dispatch.c.src Outdated Show resolved Hide resolved

seiko2plus force-pushed the to_npyv_unaryfp_g0 branch from 877a34e to 761cac3 Compare October 26, 2020 00:13

Qiyu8 reviewed Oct 27, 2020

View reviewed changes

seiko2plus force-pushed the to_npyv_unaryfp_g0 branch from 761cac3 to 24d957e Compare October 30, 2020 19:12

mattip reviewed Oct 31, 2020

View reviewed changes

seiko2plus force-pushed the to_npyv_unaryfp_g0 branch 2 times, most recently from ac9540f to 1f87265 Compare October 31, 2020 20:03

seiko2plus commented Oct 31, 2020

View reviewed changes

mattip reviewed Nov 1, 2020

View reviewed changes

seiko2plus added 4 commits November 3, 2020 14:33

ENH, SIMD: Add sqrt, abs, recip and square intrinsics for f32/64

ade6638

this patch also improves division precision for NEON/A32

ENH, SIMD: Replace raw SIMD of unary float point(32-64) with NPYV - g0

34d2672

- only covers sqrt, absolute, square and reciprocal - fix SIMD memory overlap check for aliasing(same ptr & stride) - unify fp/domain errors for both scalars and vectors

MAINT: reorder Python.h to suppress warning 'declaration of 'struct…

cc564c0

… timespec*'

TST, SIMD: Add test cases for sqrt, abs, recip and square intrinsics

c811166

seiko2plus force-pushed the to_npyv_unaryfp_g0 branch from 1f87265 to c811166 Compare November 3, 2020 13:20

mattip added the triage review Issue/PR to be discussed at the next triage meeting label Nov 4, 2020

mattip removed the triage review Issue/PR to be discussed at the next triage meeting label Nov 4, 2020

hameerabbasi mentioned this pull request Nov 9, 2020

ENH: Standalone benchmark script for the inner loops of ufunc #15987

Open

mattip approved these changes Nov 9, 2020

View reviewed changes

mattip merged commit a20eca2 into numpy:master Nov 10, 2020

This was referenced Nov 10, 2020

drop python3.6, add pypy3.7 MacPython/numpy-wheels#103

Merged

BUG: Invalid value in inplace log #17761

Closed

Uh oh!

Conversation

seiko2plus commented May 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This pull-request:

Performance tests

Args used within #15987

X86

Benchmark

ARM8 64-bit

Benchmark

Power little-endian

Benchmark

Binary size of _multiarray_umath.cpython-ver-arch-linux-gnu.so in kbytes

Uh oh!

mattip commented Aug 19, 2020

Uh oh!

charris commented Sep 8, 2020

Uh oh!

seiko2plus commented Sep 8, 2020

Uh oh!

seiko2plus commented Sep 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seiko2plus Oct 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seiko2plus Oct 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seiko2plus commented Nov 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qiyu8 commented Nov 2, 2020

Uh oh!

mattip commented Nov 3, 2020

Uh oh!

seiko2plus commented Nov 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattip commented Nov 4, 2020

Uh oh!

mattip commented Nov 4, 2020

seiko2plus commented May 15, 2020 •

edited

Loading

Binary size of `_multiarray_umath.cpython-ver-arch-linux-gnu.so` in kbytes

seiko2plus commented Sep 8, 2020 •

edited

Loading

seiko2plus Oct 31, 2020 •

edited

Loading

seiko2plus Oct 31, 2020 •

edited

Loading

seiko2plus commented Nov 1, 2020 •

edited

Loading

seiko2plus commented Nov 3, 2020 •

edited

Loading