ENH, SIMD: improve argmax/argmin performance#20846
Conversation
seiko2plus
left a comment
There was a problem hiding this comment.
npyv_cleanup() missing, it's important to be represented in case of the compiler doesn't generate zeroupper or zeroall to avoid the AVX-SSE transition penalty.
mattip
left a comment
There was a problem hiding this comment.
Cool speedups! If I understand correctly, the strategy is to use 512-bit wide vectors (svx) where possible and "emulate" them with smaller vectors where not possible. This provides the most compact code since we do not need 256- or 128-bit-specific loops. Does the compiler "do the right thing" to make the code run at the same speed as if we wrote 128-bit specific loops?
There are a few spots that coverage claims are not hit.
|
|
||
| def time_argmin(self, dtype): | ||
| np.argmin(self.d) | ||
|
|
There was a problem hiding this comment.
Maybe overkill? Is there any real difference between argmax and argmin? I guess there might slight differences since they are different loops
|
@mattip, sorry, I don't clearly understand your question but this implementation works perfectly with all current supported of SIMD widths (128, 256, 512) and there's no kind of scalar emulation. it just requires the minimum length of the array to be SIMD_WIDTH*4 because the reduced operation for the final vector accumulator isn't cheap and it will cause performance regressions on small arrays. also, the reason behind unrolling by 4 vector loads for each iteration is based on the fact that most of the new CPUs nowadays can execute multiple instructions logical, bitwise, and comparison operations at the same time. |
|
Thanks, unrolling by 4 was the thing that caught my eye, and your explanation makes it clear. |
|
Looking at the benchmark timings, it seems to be that across architectures the current code is operation-bound (all the dtypes except bool take approximately the same time), where the new code make the ufunc performance more memory-bound (smaller dtypes are faster), and that there is a cost to processing nans. |
3c39c06 to
ed84326
Compare
for two reasons:
numpy/numpy/core/src/multiarray/arraytypes.c.src Lines 3262 to 3274 in ed84326
- 214±0.5μs 28.0±0.1μs 0.13 bench_reduce.ArgMax.time_argmax(<class 'bool'>)
that's true but I can give the priority for numeric values and gain around 40% speed but that will cause a pretty bad impact on nanargmax/nanargmin performance. my idea is to invert the FP test and avoid breaking the loop. By the way, the nan benchmark isn't fear enough, and it doesn't expose the real improvements. see: numpy/benchmarks/benchmarks/bench_lib.py Lines 72 to 74 in ed84326 |
ed84326 to
60972df
Compare
7f32216 to
88c13b9
Compare
for all integers, f32 and f64 data types on all supported architectures via universal intrinsics.
to avoid the AVX-SSE transition penalty.
88c13b9 to
52787cc
Compare
|
@mattip, code coverage almost hit all the paths. I think it's ready for merge. |
|
Thanks @seiko2plus |
|
I forgot, do we add release notes for these performance enhancements? Maybe we could add a single note and mention all the ufuncs we have converted to SIMD. |
|
We have a "performance" category. So we do, but are very very spotty about it. It is also OK to add multiple for each I think, just make sure to use the brief bullet format |
for all integers, f32 and f64 data types on all
supported architectures via universal intrinsics.
related to #20785, #20131
X86
CPU
OS
Linux ip-172-31-32-40 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:03:59 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Python 3.8.10 gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0Benchmark
AVX512_SKX
before after ratio [4b985e2b] [c99921ad] <simd_argmaxmin~1> <simd_argmaxmin> - 1.66±0ms 1.54±0ms 0.93 bench_lib.Nan.time_nanargmax(200000, 50.0) - 926±3μs 850±3μs 0.92 bench_lib.Nan.time_nanargmax(200000, 90.0) - 1.72±0ms 1.55±0ms 0.90 bench_lib.Nan.time_nanargmin(200000, 50.0) - 380±3μs 325±0.4μs 0.86 bench_lib.Nan.time_nanargmax(200000, 0.1) - 993±4μs 849±3μs 0.85 bench_lib.Nan.time_nanargmin(200000, 90.0) - 378±3μs 323±0.2μs 0.85 bench_lib.Nan.time_nanargmax(200000, 0) - 482±3μs 406±0.8μs 0.84 bench_lib.Nan.time_nanargmax(200000, 2.0) - 548±3μs 407±0.8μs 0.74 bench_lib.Nan.time_nanargmin(200000, 2.0) - 444±3μs 326±0.9μs 0.73 bench_lib.Nan.time_nanargmin(200000, 0.1) - 441±3μs 322±0.9μs 0.73 bench_lib.Nan.time_nanargmin(200000, 0) - 6.49±0.08μs 4.45±0.01μs 0.68 bench_reduce.ArgMax.time_argmax(<class 'bool'>) - 182±0.3μs 59.8±0.4μs 0.33 bench_reduce.ArgMin.time_argmin(<class 'numpy.float64'>) - 182±0.1μs 59.6±0.6μs 0.33 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint64'>) - 182±0.07μs 59.2±0.2μs 0.33 bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>) - 124±0.4μs 26.8±20μs 0.22 bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>) - 122±0.4μs 18.9±0.2μs 0.16 bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>) - 180±0.2μs 18.8±0.1μs 0.10 bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>) - 180±0.2μs 18.6±0.4μs 0.10 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>) - 202±4μs 19.1±20μs 0.09 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>) - 204±4μs 19.1±20μs 0.09 bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>) - 120±0.1μs 9.56±0.04μs 0.08 bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>) - 120±0.2μs 9.54±0.02μs 0.08 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>) - 218±3μs 13.7±2μs 0.06 bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>) - 193±5μs 11.0±4μs 0.06 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>) - 212±2μs 11.0±3μs 0.05 bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>) - 179±0.07μs 6.15±0.02μs 0.03 bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>) - 179±0.05μs 6.12±0.02μs 0.03 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>) - 218±0.7μs 7.07±1μs 0.03 bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>) - 219±1μs 7.05±1μs 0.03 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>) - 190±2μs 6.01±0.02μs 0.03 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>) - 191±3μs 6.02±0.01μs 0.03 bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)AVX2
before after ratio [4b985e2b] [c99921ad] <simd_argmaxmin~1> <simd_argmaxmin> - 964±3μs 895±10μs 0.93 bench_lib.Nan.time_nanargmin(200000, 90.0) - 1.65±0ms 1.53±0ms 0.93 bench_lib.Nan.time_nanargmin(200000, 50.0) - 365±1μs 332±0.9μs 0.91 bench_lib.Nan.time_nanargmax(200000, 0) - 370±0.5μs 336±1μs 0.91 bench_lib.Nan.time_nanargmax(200000, 0.1) - 6.49±0.06μs 5.80±0.04μs 0.89 bench_reduce.ArgMax.time_argmax(<class 'bool'>) - 470±2μs 417±1μs 0.89 bench_lib.Nan.time_nanargmax(200000, 2.0) - 532±2μs 418±0.5μs 0.79 bench_lib.Nan.time_nanargmin(200000, 2.0) - 432±0.8μs 335±0.6μs 0.78 bench_lib.Nan.time_nanargmin(200000, 0.1) - 428±2μs 332±1μs 0.77 bench_lib.Nan.time_nanargmin(200000, 0) - 124±0.1μs 61.4±7μs 0.50 bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>) - 182±0.09μs 75.1±0.3μs 0.41 bench_reduce.ArgMin.time_argmin(<class 'numpy.float64'>) - 182±0.08μs 71.4±0.3μs 0.39 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint64'>) - 182±0.09μs 66.2±0.3μs 0.36 bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>) - 202±4μs 59.5±6μs 0.29 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>) - 122±0.3μs 35.3±0.3μs 0.29 bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>) - 203±3μs 49.7±8μs 0.25 bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>) - 180±0.09μs 32.5±0.3μs 0.18 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>) - 180±0.1μs 28.8±0.2μs 0.16 bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>) - 198±4μs 30.0±2μs 0.15 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>) - 217±4μs 32.0±2μs 0.15 bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>) - 120±0.2μs 17.3±0.05μs 0.14 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>) - 120±0.1μs 15.2±0.02μs 0.13 bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>) - 213±5μs 25.2±1μs 0.12 bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>) - 218±0.7μs 16.4±0.2μs 0.08 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>) - 217±2μs 15.3±0.03μs 0.07 bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>) - 179±0.05μs 9.81±0.04μs 0.05 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>) - 193±4μs 9.80±0.04μs 0.05 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>) - 179±0.07μs 8.47±0.01μs 0.05 bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>) - 193±3μs 8.64±0.03μs 0.04 bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)SSE42
before after ratio [4b985e2b] [c99921ad] <simd_argmaxmin~1> <simd_argmaxmin> - 1.60±0ms 1.52±0ms 0.95 bench_lib.Nan.time_nanargmax(200000, 50.0) - 470±0.7μs 442±4μs 0.94 bench_lib.Nan.time_nanargmax(200000, 2.0) - 6.55±0.08μs 6.05±0.02μs 0.92 bench_reduce.ArgMax.time_argmax(<class 'bool'>) - 1.66±0ms 1.51±0ms 0.91 bench_lib.Nan.time_nanargmin(200000, 50.0) - 964±2μs 865±2μs 0.90 bench_lib.Nan.time_nanargmin(200000, 90.0) - 430±0.9μs 359±3μs 0.84 bench_lib.Nan.time_nanargmin(200000, 0) - 433±1μs 360±3μs 0.83 bench_lib.Nan.time_nanargmin(200000, 0.1) - 534±1μs 441±3μs 0.83 bench_lib.Nan.time_nanargmin(200000, 2.0) - 124±0.1μs 95.6±3μs 0.77 bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>) - 182±0.1μs 102±0.9μs 0.56 bench_reduce.ArgMin.time_argmin(<class 'numpy.float64'>) - 182±0.1μs 101±0.4μs 0.55 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint64'>) - 182±0.2μs 81.6±0.2μs 0.45 bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>) - 122±4μs 54.3±0.4μs 0.44 bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>) - 213±10μs 89.3±6μs 0.42 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>) - 209±8μs 72.9±6μs 0.35 bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>) - 180±0.2μs 48.3±0.2μs 0.27 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>) - 218±3μs 48.4±2μs 0.22 bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>) - 180±0.06μs 39.3±0.4μs 0.22 bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>) - 209±10μs 45.2±1μs 0.22 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>) - 122±2μs 24.1±0.03μs 0.20 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>) - 219±4μs 38.7±1μs 0.18 bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>) - 120±2μs 19.4±0.01μs 0.16 bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>) - 219±0.5μs 24.2±0.05μs 0.11 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>) - 219±0.9μs 20.5±0.04μs 0.09 bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>) - 179±0.05μs 13.2±0.01μs 0.07 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>) - 207±10μs 13.3±0.03μs 0.06 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>) - 179±0.1μs 11.0±0.02μs 0.06 bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>) - 209±10μs 11.1±0.02μs 0.05 bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)BASELINE(SSE3)
before after ratio [4b985e2b] [c99921ad] <simd_argmaxmin~1> <simd_argmaxmin> - 1.66±0ms 1.54±0.01ms 0.93 bench_lib.Nan.time_nanargmin(200000, 50.0) - 198±5μs 183±1μs 0.93 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>) - 967±3μs 883±2μs 0.91 bench_lib.Nan.time_nanargmin(200000, 90.0) - 433±2μs 376±4μs 0.87 bench_lib.Nan.time_nanargmin(200000, 0) - 438±2μs 377±3μs 0.86 bench_lib.Nan.time_nanargmin(200000, 0.1) - 124±0.3μs 106±6μs 0.86 bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>) - 535±3μs 457±4μs 0.85 bench_lib.Nan.time_nanargmin(200000, 2.0) - 182±0.2μs 153±0.6μs 0.84 bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>) - 202±3μs 150±1μs 0.75 bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>) - 182±0.2μs 115±2μs 0.63 bench_reduce.ArgMin.time_argmin(<class 'numpy.float64'>) - 122±0.4μs 64.2±0.9μs 0.53 bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>) - 180±0.1μs 58.4±0.6μs 0.32 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>) - 180±0.2μs 50.4±0.8μs 0.28 bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>) - 197±7μs 55.0±2μs 0.28 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>) - 218±3μs 58.4±3μs 0.27 bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>) - 121±0.2μs 29.6±0.5μs 0.25 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>) - 213±2μs 47.0±1μs 0.22 bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>) - 120±0.3μs 24.0±0.4μs 0.20 bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>) - 219±0.4μs 29.3±0.2μs 0.13 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>) - 219±0.9μs 23.7±0.2μs 0.11 bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>) - 179±0.2μs 17.7±0.1μs 0.10 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>) - 192±1μs 17.7±0.08μs 0.09 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>) - 179±0.1μs 15.4±0.06μs 0.09 bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>) - 188±2μs 14.8±0.08μs 0.08 bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)Power little-endian
CPU
OS
Linux e517009a912a 4.19.0-2-powerpc64le #1 SMP Debian 4.19.16-1 (2019-01-17) ppc64le ppc64le ppc64le GNU/Linux gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0Benchmark
baseline(VSX2)
python runtests.py -n --bench-compare parent/main "argmax|argmin" -- --sort ratiobefore after ratio [4b985e2b] [c99921ad] <simd_argmaxmin~1> <simd_argmaxmin> - 1.90±0.01ms 1.78±0ms 0.93 bench_lib.Nan.time_nanargmin(200000, 90.0) - 294±0.6μs 275±0.9μs 0.93 bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>) - 1.10±0.01ms 997±7μs 0.90 bench_lib.Nan.time_nanargmin(200000, 2.0) - 896±6μs 778±2μs 0.87 bench_lib.Nan.time_nanargmin(200000, 0.1) - 885±10μs 767±2μs 0.87 bench_lib.Nan.time_nanargmin(200000, 0) - 373±2μs 275±0.3μs 0.74 bench_reduce.ArgMin.time_argmin(<class 'numpy.float64'>) - 219±0.2μs 156±0.5μs 0.71 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>) - 219±2μs 154±1μs 0.70 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>) - 218±0.08μs 152±0.3μs 0.70 bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>) - 218±0.2μs 152±0.08μs 0.70 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint64'>) - 218±0.3μs 148±2μs 0.68 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>) - 219±0.8μs 149±2μs 0.68 bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>) - 237±0.09μs 156±0.4μs 0.66 bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>) - 235±1μs 155±1μs 0.66 bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>) - 329±0.2μs 178±0.3μs 0.54 bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>) - 327±1μs 176±1μs 0.54 bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>) - 216±0.07μs 50.3±0.1μs 0.23 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>) - 215±1μs 50.0±1μs 0.23 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>) - 233±0.5μs 50.6±0.06μs 0.22 bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>) - 232±0.9μs 48.9±0.9μs 0.21 bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>) - 216±0.06μs 28.7±0.1μs 0.13 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>) - 216±0.6μs 28.5±0.04μs 0.13 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>) - 214±0.5μs 28.0±0.1μs 0.13 bench_reduce.ArgMax.time_argmax(<class 'bool'>) - 233±0.1μs 28.6±0.1μs 0.12 bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>) - 233±0.2μs 28.6±0.1μs 0.12 bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>)AArch64
CPU
OS
Linux ip-172-31-44-172 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:01:34 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0Benchmark
baseline(ASIMD)
python runtests.py --bench-compare parent/main "argmax|argmin" -- --sort ratiobefore after ratio [4b985e2b] [c99921ad] <simd_argmaxmin~1> <simd_argmaxmin> - 20.5±0.02μs 18.2±0.01μs 0.89 bench_reduce.ArgMax.time_argmax(<class 'bool'>) - 169±2μs 88.9±4μs 0.53 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>) - 169±2μs 88.9±3μs 0.53 bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>) - 165±0.1μs 83.1±0.6μs 0.50 bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>) - 165±0.3μs 82.8±0.5μs 0.50 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint64'>) - 246±1μs 81.1±0.9μs 0.33 bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>) - 244±0.2μs 79.1±0.2μs 0.32 bench_reduce.ArgMin.time_argmin(<class 'numpy.float32'>) - 166±0.8μs 45.1±2μs 0.27 bench_reduce.ArgMax.time_argmax(<class 'numpy.int32'>) - 166±0.9μs 44.8±1μs 0.27 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint32'>) - 164±0.2μs 43.1±0.3μs 0.26 bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>) - 164±0.06μs 43.0±0.2μs 0.26 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>) - 170±0.4μs 24.9±0.5μs 0.15 bench_reduce.ArgMax.time_argmax(<class 'numpy.int16'>) - 170±0.3μs 24.9±0.5μs 0.15 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint16'>) - 169±0.1μs 24.6±0.2μs 0.15 bench_reduce.ArgMin.time_argmin(<class 'numpy.int16'>) - 169±0.2μs 24.5±0.07μs 0.15 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>) - 163±0.1μs 13.9±0.04μs 0.08 bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>) - 163±0.1μs 13.9±0.09μs 0.08 bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>) - 163±0.09μs 13.8±0.01μs 0.08 bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>) - 163±0.1μs 13.8±0.03μs 0.08 bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>)Binary size(striped)