BUG: Fix simd loadable stride logic by seberg · Pull Request #27027 · numpy/numpy

seberg · 2024-07-24T13:46:00Z

It was never correct that strides are divisable by their itemsize and
some code even checked for it correctly, while others had asserts
in place that had no reason why they should be guaranteed to pass.

This moves the check into the loadable_stride macro. The exp
implementation has it's own custom logic, that is probably OK.

I assume that compilers will optimize the duplicate division away,
if that may not be the case, the code should be optimized to avoid
it probably.
(although, I guess these are power of 2 divisions so they don't matter actually)

Closes gh-26775

It was never correct that strides are divisable by their itemsize and *some* code even checked for it correctly, while others had asserts in place that had no reason why they should be guaranteed to pass. This moves the check into the `loadable_stride` macro. The exp implementation has it's own custom logic, that is probably OK. I assume that compilers will optimize the duplicate division away, if that may not be the case, the code should be optimized to avoid it probably.

…sert

seberg · 2024-07-24T15:30:26Z

I have confirmed that this fixes the issue in the i386 docker env. It should be carefully reviewed anyway, since I am not sure how good the tests really are. The tests at gh-27028 crashed (presumably hitting one of the deleted asserts).

EDIT: After removing the asserts, they failed beautifully.

charris · 2024-07-26T20:06:29Z

Does this need a backport? There are a couple of other SIMD fixes...

seberg · 2024-07-26T20:31:14Z

It's a huge bug, but I dunno, since it is also a larger change, and I think around for a while...

mattip · 2024-07-30T12:46:15Z

After removing the asserts, they failed beautifully.

nice

mattip · 2024-07-30T12:47:19Z

I assume that compilers will optimize the duplicate division away,

The code looks good to me. Could you benchmark a relevant ufunc to see if there is any impact?

r-devulap · 2024-07-30T15:54:52Z

the exp implementation has it's own custom logic, that is probably OK.

You mean here?

numpy/numpy/_core/src/umath/loops_exponent_log.dispatch.c.src

Line 1287 in 893215a

if (IS_OUTPUT_BLOCKABLE_UNARY(sizeof(npy_float), sizeof(npy_float), 64)) {

Perhaps, we should use the same logic everywhere for consistency?

r-devulap

Minor nit but otherwise LGTM. Thanks @seberg for fixing this!

r-devulap · 2024-07-30T16:38:32Z

numpy/_core/src/common/simd/simd.h

+        return MAXLOAD > 0 ? llabs(stride) <= MAXLOAD : 1;                  \
+    }                                                                       \
+    NPY_FINLINE int                                                         \
+    npyv_storable_stride_##SFX(npy_intp stride)                             \


Wondering if these can fold into one function. It duplicates the same logic.

I can fold it into a single macro. Although, can MAXLOAD and MAXSTORE be reasonably different? Maybe it should just be NPY_SIMD_MAX_STRIDE?

As of now, only avx512 defines MAXSTORE and it is equal to MAXLOAD. Let's fold into a single value. There was no magic way to calculate them, IIRC it was a simple heuristic.

Actually, AVX2 defines only the load version. I might suspect that it is quite silly to not just use that for the store either way (also stores are much less likely to have huge strides).

But, it makes me want to not include the change here.

seberg · 2024-07-31T11:40:35Z

Not sure there is a test that really might notice it (unless we disable simd accidentally), but I see these timings. That is huge fluctuations, but I can't say any of them look systematic (locally on mac).

The only one that seems a bit plausible might be the 3rd one:

bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'absolute'>, 1, 1, 'Q')

Details

| Change   | Before [50bd355e] <main>   | After [e27f4190] <sane-simd-steps>   |   Ratio | Benchmark (Parameter)                                                                           |
|----------|----------------------------|--------------------------------------|---------|-------------------------------------------------------------------------------------------------|
| +        | 2.09±0.01μs                | 3.55±0.06μs                          |    1.7  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'negative'>, 1, 2, 'f')                    |
| +        | 2.11±0.03μs                | 3.52±0.01μs                          |    1.67 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'negative'>, 1, 2, 'f')                           |
| +        | 12.1±1μs                   | 17.0±0.02μs                          |    1.4  | bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'absolute'>, 1, 1, 'Q')                    |
| +        | 9.59±0.01μs                | 12.7±0.1μs                           |    1.33 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 1, 'd')                            |
| +        | 9.81±0.09μs                | 13.0±0.2μs                           |    1.33 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 1, 2, 'd')                     |
| +        | 9.59±0.01μs                | 12.7±0.02μs                          |    1.32 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 1, 'd')                            |
| +        | 9.70±0.01μs                | 12.8±0.04μs                          |    1.32 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 2, 'd')                            |
| +        | 9.70±0μs                   | 12.8±0μs                             |    1.32 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 2, 'd')                            |
| +        | 9.70±0.01μs                | 12.8±0.01μs                          |    1.32 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 2, 'd')                            |
| +        | 9.71±0.03μs                | 12.8±0.02μs                          |    1.32 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 1, 'd')                            |
| +        | 9.71±0.01μs                | 12.8±0.01μs                          |    1.32 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 2, 'd')                            |
| +        | 9.69±0.06μs                | 12.8±0.1μs                           |    1.32 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 1, 1, 'd')                     |
| +        | 9.73±0.01μs                | 12.8±0.01μs                          |    1.32 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 4, 1, 'd')                     |
| +        | 9.62±0.01μs                | 12.7±0μs                             |    1.32 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 1, 1, 'd')                     |
| +        | 9.80±0.1μs                 | 12.8±0.01μs                          |    1.31 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 1, 2, 'd')                     |
| +        | 9.78±0.03μs                | 12.8±0.01μs                          |    1.31 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 4, 2, 'd')                     |
| +        | 9.88±0.2μs                 | 12.9±0.1μs                           |    1.3  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 1, 'd')                            |
| +        | 9.84±0.09μs                | 12.8±0.1μs                           |    1.3  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 4, 2, 'd')                     |
| +        | 9.88±0.09μs                | 12.8±0.02μs                          |    1.3  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 4, 1, 'd')                     |
| +        | 2.15±0.02μs                | 2.72±0.03μs                          |    1.27 | bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int16'>)                      |
| +        | 3.71±0.03μs                | 4.41±0μs                             |    1.19 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'reciprocal'>, 4, 1, 'd')                  |
| +        | 2.95±0.1μs                 | 3.51±0μs                             |    1.19 | bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'logical_not'>, 1, 1, 'h')                 |
| +        | 3.79±0.3μs                 | 4.47±0.09μs                          |    1.18 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rint'>, 4, 1, 'd')                        |
| +        | 4.90±0.7μs                 | 5.68±0.01μs                          |    1.16 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 2, 4, 1, 'D')           |
| +        | 3.51±0.4μs                 | 4.04±0.02μs                          |    1.15 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 1, 1, 1, 'D')           |
| +        | 3.55±0.5μs                 | 4.09±0.01μs                          |    1.15 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'subtract'>, 1, 1, 1, 'D')      |
| +        | 66.2±0.7μs                 | 76.0±4μs                             |    1.15 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arccos'>, 1, 1, 'd')                             |
| +        | 1.97±0.02μs                | 2.27±0.3μs                           |    1.15 | bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'isfinite'>, 1, 1, 'i')                    |
| +        | 3.69±0.03μs                | 4.21±0.04μs                          |    1.14 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 1, 2, 'd')                      |
| +        | 3.59±0.4μs                 | 4.04±0.02μs                          |    1.13 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'subtract'>, 1, 2, 1, 'D')      |
| +        | 3.69±0.01μs                | 4.18±0.01μs                          |    1.13 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 1, 2, 'd')                           |
| +        | 3.55±0μs                   | 3.98±0μs                             |    1.12 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 4, 2, 'f')                      |
| +        | 3.71±0.01μs                | 4.16±0μs                             |    1.12 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 1, 2, 'd')                      |
| +        | 3.55±0μs                   | 3.98±0.2μs                           |    1.12 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 4, 2, 'f')               |
| +        | 15.3±1μs                   | 17.1±0.1μs                           |    1.12 | bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'sign'>, 1, 1, 'l')                        |
| +        | 3.79±0.03μs                | 4.17±0.04μs                          |    1.1  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 1, 2, 'd')               |
| +        | 110±0.01μs                 | 120±0.1μs                            |    1.09 | bench_ufunc.UFunc.time_ufunc_types('radians')                                                   |
| +        | 4.65±0.4μs                 | 5.07±0.02μs                          |    1.09 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'multiply'>, 4, 4, 1, 'F')      |
| +        | 1.03±0.01μs                | 1.12±0.01μs                          |    1.09 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in1(<ufunc 'multiply'>, 1, 1, 1, 'I')    |
| +        | 67.9±0.8μs                 | 73.9±3μs                             |    1.09 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arccos'>, 4, 2, 'd')                             |
| +        | 1.33±0.02μs                | 1.46±0.03μs                          |    1.09 | bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'isnan'>, 1, 1, 'L')                       |
| +        | 111±1μs                    | 120±0.2μs                            |    1.08 | bench_ufunc.UFunc.time_ufunc_types('deg2rad')                                                   |
| +        | 708±7ns                    | 763±30ns                             |    1.08 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'bitwise_xor'>, 1, 1, 1, 'B') |
| +        | 1.51±0.02μs                | 1.61±0.02μs                          |    1.07 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'add'>, 1, 1, 1, 'Q')         |
| +        | 1.50±0.01μs                | 1.60±0.05μs                          |    1.07 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'bitwise_or'>, 1, 1, 1, 'Q')  |
| +        | 1.50±0.01μs                | 1.60±0.02μs                          |    1.07 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'subtract'>, 1, 1, 1, 'Q')    |
| +        | 851±3ns                    | 910±20ns                             |    1.07 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in1(<ufunc 'add'>, 1, 1, 1, 'h')         |
| +        | 1.50±0.02μs                | 1.60±0.02μs                          |    1.07 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in1(<ufunc 'add'>, 1, 1, 1, 'q')         |
| +        | 3.50±0.04μs                | 3.73±0.04μs                          |    1.07 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'negative'>, 1, 2, 'd')                           |
| +        | 3.50±0.03μs                | 3.76±0.05μs                          |    1.07 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 4, 1, 'f')               |
| +        | 9.03±0.03μs                | 9.61±0.1μs                           |    1.06 | bench_ufunc.UFunc.time_ufunc_types('bitwise_not')                                               |
| +        | 1.02±0μs                   | 1.08±0.01μs                          |    1.06 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'bitwise_xor'>, 1, 1, 1, 'I') |
| +        | 709±2ns                    | 749±10ns                             |    1.06 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'bitwise_xor'>, 1, 1, 1, 'b') |
| +        | 738±10ns                   | 779±20ns                             |    1.06 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in1(<ufunc 'add'>, 1, 1, 1, 'B')         |
| +        | 1.04±0.01μs                | 1.10±0.02μs                          |    1.06 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in1(<ufunc 'bitwise_and'>, 1, 1, 1, 'i') |
| +        | 794±1ns                    | 838±10ns                             |    1.06 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in1(<ufunc 'logical_and'>, 1, 1, 1, 'H') |
| +        | 739±7ns                    | 784±30ns                             |    1.06 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 1, 1, 'f')                              |
| +        | 3.52±0.01μs                | 3.74±0μs                             |    1.06 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 4, 1, 'f')               |
| +        | 16.0±0.9μs                 | 17.0±0.01μs                          |    1.06 | bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'positive'>, 1, 1, 'q')                    |
| +        | 58.5±0.4μs                 | 61.7±0.2μs                           |    1.05 | bench_ufunc.UFunc.time_ufunc_types('isfinite')                                                  |
| +        | 1.55±0.01μs                | 1.63±0.05μs                          |    1.05 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'bitwise_and'>, 1, 1, 1, 'Q') |
| +        | 701±4ns                    | 739±6ns                              |    1.05 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'bitwise_and'>, 1, 1, 1, 'b') |
| +        | 1.06±0.01μs                | 1.12±0.01μs                          |    1.05 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'bitwise_and'>, 1, 1, 1, 'i') |
| +        | 713±4ns                    | 749±10ns                             |    1.05 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in1(<ufunc 'subtract'>, 1, 1, 1, 'b')    |
| +        | 3.55±0.04μs                | 3.74±0.01μs                          |    1.05 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 4, 1, 'f')                      |
| +        | 38.8±0.1μs                 | 40.9±0.6μs                           |    1.05 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log'>, 4, 2, 'd')                                |
| +        | 69.2±0.2μs                 | 72.9±1μs                             |    1.05 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log'>, 1, 1, 'e')                         |
| -        | 390±6ns                    | 371±5ns                              |    0.95 | bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.)))                             |
| -        | 412±20ns                   | 393±1ns                              |    0.95 | bench_ufunc.CustomArrayFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 100)        |
| -        | 325±6ns                    | 308±0.8ns                            |    0.95 | bench_ufunc.Methods0DInvert.time_ndarray__0d__('int16')                                         |
| -        | 458±7ns                    | 433±2ns                              |    0.95 | bench_ufunc.MethodsV1.time_ndarray_meth('__le__', 'complex64')                                  |
| -        | 387±7ns                    | 368±0.8ns                            |    0.95 | bench_ufunc.MethodsV1.time_ndarray_meth('__lt__', 'int64')                                      |
| -        | 687±5ns                    | 653±10ns                             |    0.95 | bench_ufunc.MethodsV1.time_ndarray_meth('__matmul__', 'complex64')                              |
| -        | 70.1±2μs                   | 66.7±0.2μs                           |    0.95 | bench_ufunc.UFunc.time_ufunc_types('copysign')                                                  |
| -        | 139±2μs                    | 132±0.2μs                            |    0.95 | bench_ufunc.UFunc.time_ufunc_types('spacing')                                                   |
| -        | 481±10ns                   | 456±5ns                              |    0.95 | bench_ufunc.UFuncSmall.time_ufunc_small_int_array('sqrt')                                       |
| -        | 6.67±0.1μs                 | 6.32±0.02μs                          |    0.95 | bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 4, 1, 2, 'F')                      |
| -        | 7.66±0.2μs                 | 7.28±0.01μs                          |    0.95 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'multiply'>, 1, 2, 2, 'D')      |
| -        | 3.36±0.01μs                | 3.19±0.2μs                           |    0.95 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'subtract'>, 2, 2, 1, 'F')      |
| -        | 1.17±0.03μs                | 1.12±0.01μs                          |    0.95 | bench_ufunc_strides.BinaryFPSpecial.time_binary(<ufunc 'maximum'>, 1, 1, 1, 'f')                |
| -        | 479±20ns                   | 453±0.5ns                            |    0.95 | bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'bitwise_xor'>, 1, 1, 1, 'b')            |
| -        | 3.73±0.01μs                | 3.54±0.02μs                          |    0.95 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 4, 1, 'f')                 |
| -        | 1.61±0.04ms                | 1.51±0.01ms                          |    0.94 | bench_ufunc.CustomArrayFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 1000000)     |
| -        | 374±4ns                    | 353±3ns                              |    0.94 | bench_ufunc.MethodsV1.time_ndarray_meth('__le__', 'int16')                                      |
| -        | 671±20ns                   | 630±4ns                              |    0.94 | bench_ufunc.MethodsV1.time_ndarray_meth('__matmul__', 'float32')                                |
| -        | 438±10ns                   | 411±0.8ns                            |    0.94 | bench_ufunc.MethodsV1.time_ndarray_meth('__ne__', 'complex64')                                  |
| -        | 65.7±4μs                   | 61.5±0.2μs                           |    0.94 | bench_ufunc.UFunc.time_ufunc_types('trunc')                                                     |
| -        | 452±9ns                    | 424±1ns                              |    0.94 | bench_ufunc.UFuncSmall.time_ufunc_numpy_scalar('cos')                                           |
| -        | 321±5ns                    | 303±1ns                              |    0.94 | bench_ufunc.UFuncSmall.time_ufunc_small_array('abs')                                            |
| -        | 327±9ns                    | 308±0.5ns                            |    0.94 | bench_ufunc.UFuncSmall.time_ufunc_small_array('cos')                                            |
| -        | 484±3ns                    | 454±3ns                              |    0.94 | bench_ufunc.UFuncSmall.time_ufunc_small_int_array('cos')                                        |
| -        | 8.89±0.05μs                | 8.38±0μs                             |    0.94 | bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 4, 2, 4, 'F')                      |
| -        | 20.7±0.7μs                 | 19.5±0.04μs                          |    0.94 | bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 4, 4, 4, 'D')                      |
| -        | 14.3±0.05μs                | 13.5±0.03μs                          |    0.94 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 4, 4, 4, 'D')           |
| -        | 7.90±0.06μs                | 7.46±0.01μs                          |    0.94 | bench_ufunc_strides.BinaryFP.time_binary_scalar_in0(<ufunc 'fmin'>, 2, 4, 4, 'd')               |
| -        | 3.74±0.02μs                | 3.53±0.01μs                          |    0.94 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 4, 1, 'f')                        |
| -        | 1.61±0.1μs                 | 1.51±0.02μs                          |    0.94 | bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'conjugate'>, 1, 1, 'b')                   |
| -        | 1.41±0.04μs                | 1.31±0.02μs                          |    0.93 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint16'>, 43)      |
| -        | 320±5ns                    | 298±0.9ns                            |    0.93 | bench_ufunc.Methods0DBoolComplex.time_ndarray__0d__('__complex__', 'float32')                   |
| -        | 211±4μs                    | 197±1μs                              |    0.93 | bench_ufunc.UFunc.time_ufunc_types('divide')                                                    |
| -        | 15.5±0.07μs                | 14.4±0.8μs                           |    0.93 | bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'divide'>, 4, 4, 1, 'D')                   |
| -        | 551±10ns                   | 514±6ns                              |    0.93 | bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'logical_and'>, 1, 1, 1, 'B')            |
| -        | 494±2ns                    | 457±2ns                              |    0.93 | bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'subtract'>, 1, 1, 1, 'b')               |
| -        | 4.43±0.03μs                | 4.10±0.3μs                           |    0.93 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 4, 1, 'd')                              |
| -        | 1.42±0.04μs                | 1.30±0.01μs                          |    0.92 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint16'>, 8)       |
| -        | 59.9±1μs                   | 54.9±2μs                             |    0.92 | bench_ufunc.UFunc.time_ufunc_types('conj')                                                      |
| -        | 105±1μs                    | 96.7±0.5μs                           |    0.92 | bench_ufunc.UFunc.time_ufunc_types('minimum')                                                   |
| -        | 5.39±0.01μs                | 4.97±0.3μs                           |    0.92 | bench_ufunc_strides.BinaryFP.time_binary(<ufunc 'fmin'>, 1, 4, 1, 'd')                          |
| -        | 3.85±0μs                   | 3.52±0μs                             |    0.92 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 4, 2, 'f')                        |
| -        | 3.84±0.01μs                | 3.53±0.01μs                          |    0.92 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 4, 2, 'f')                 |
| -        | 8.83±0.02μs                | 8.08±0.7μs                           |    0.91 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'subtract'>, 4, 4, 1, 'D')      |
| -        | 1.37±0.07μs                | 1.25±0.01μs                          |    0.91 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 1, 1, 'd')                              |
| -        | 850±40ns                   | 777±30ns                             |    0.91 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 1, 1, 'f')                              |
| -        | 4.52±0.09μs                | 4.13±0.3μs                           |    0.91 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'ceil'>, 4, 1, 'd')                        |
| -        | 7.16±0.06μs                | 6.45±0.6μs                           |    0.9  | bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'subtract'>, 2, 1, 1, 'D')                 |
| -        | 503±8ns                    | 454±3ns                              |    0.9  | bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'multiply'>, 1, 1, 1, 'B')               |
| -        | 848±60ns                   | 743±9ns                              |    0.88 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'absolute'>, 1, 1, 'f')                    |
| -        | 861±30ns                   | 753±6ns                              |    0.87 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'ceil'>, 1, 1, 'f')                               |
| -        | 5.81±0.02μs                | 5.03±0.6μs                           |    0.86 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'multiply'>, 1, 2, 1, 'D')      |
| -        | 4.55±0.07μs                | 3.81±0.03μs                          |    0.84 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 4, 2, 'd')                        |
| -        | 4.54±0.1μs                 | 3.82±0.04μs                          |    0.84 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 4, 2, 'd')                 |
| -        | 4.51±0.03μs                | 3.69±0.04μs                          |    0.82 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 4, 1, 'd')                        |
| -        | 4.51±0.01μs                | 3.69±0.03μs                          |    0.82 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 4, 1, 'd')                 |
| -        | 3.22±0.01μs                | 2.12±0.02μs                          |    0.66 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'negative'>, 4, 2, 'f')                    |
| -        | 3.26±0.03μs                | 2.11±0.01μs                          |    0.65 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'negative'>, 4, 2, 'f')                           |

charris · 2024-08-01T17:46:57Z

@seberg Are you OK with merging this as is?

seberg · 2024-08-01T18:03:08Z

Yes, of course, was just giving reviewers a chance to comment in case.

charris · 2024-08-01T19:12:46Z

Thanks Sebastian.

seberg · 2024-08-02T10:13:52Z

Hmmmpf, the test didn't fail here but does not on other CI runs. I guess machine capability fluctuations...

@r-devulap do you have time to have a look at fixing ldexp and frexp? Not sure what is best, aligning to use npyv_loadable_stride_##sfx/npyv_storable_stride_##sfx, or just adding the alignment check that some of the other macros already have here.
The choice of the max stride is also slightly different, although I think compatible.

seberg added 2 commits July 24, 2024 15:43

TST: Add test to partially cover SIMD alignment issues

9fc6a96

github-actions bot added the 00 - Bug label Jul 24, 2024

seberg mentioned this pull request Jul 24, 2024

DO NOT MERGE: Add test to partially cover SIMD alignment issues #27028

Closed

MAINT: Move len placement to avoid warning

a0c9247

seberg requested review from Mousius and r-devulap July 24, 2024 13:52

seberg added 4 commits July 24, 2024 15:55

MAINT: Use stdalign.h alignof to be compatible with C++/C

935ef6f

TST: Fixup test (need to use alignment in stride not offset)

e8f9f54

BUG: Fix definition, one more incorrect check and remove incorrect as…

b43e62a

…sert

TST: Test fixup (seems I ran the wrong one locally...)

e27f419

r-devulap approved these changes Jul 30, 2024

View reviewed changes

charris merged commit 1082661 into numpy:main Aug 1, 2024

seberg deleted the sane-simd-steps branch August 1, 2024 19:52

seberg mentioned this pull request Aug 2, 2024

CI,BUG: windows 32 bit CI runs flaky ontest_ufunc_noncontiguous ldexp/frexp #27093

Closed

r-devulap mentioned this pull request Aug 5, 2024

BUG: Use the new npyv_loadable_stride_ functions for ldexp and frexp #27115

Merged

charris mentioned this pull request Aug 6, 2024

BUG: Use the new npyv_loadable_stride_ functions for ldexp and frexp #27122

Merged

Uh oh!

Conversation

seberg commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Jul 26, 2024

Uh oh!

seberg commented Jul 26, 2024

Uh oh!

mattip commented Jul 30, 2024

Uh oh!

mattip commented Jul 30, 2024

Uh oh!

r-devulap commented Jul 30, 2024

Uh oh!

r-devulap left a comment

Choose a reason for hiding this comment

Uh oh!

r-devulap Jul 30, 2024

Choose a reason for hiding this comment

Uh oh!

seberg Jul 30, 2024

Choose a reason for hiding this comment

Uh oh!

r-devulap Jul 30, 2024

Choose a reason for hiding this comment

Uh oh!

seberg Jul 31, 2024

Choose a reason for hiding this comment

Uh oh!

seberg commented Jul 31, 2024

Uh oh!

charris commented Aug 1, 2024

Uh oh!

seberg commented Aug 1, 2024

Uh oh!

charris commented Aug 1, 2024

Uh oh!

seberg commented Aug 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

seberg commented Jul 24, 2024 •

edited

Loading

seberg commented Jul 24, 2024 •

edited

Loading