Skip to content

BUG: Fix simd loadable stride logic #27027

Merged
charris merged 7 commits intonumpy:mainfrom
seberg:sane-simd-steps
Aug 1, 2024
Merged

BUG: Fix simd loadable stride logic #27027
charris merged 7 commits intonumpy:mainfrom
seberg:sane-simd-steps

Conversation

@seberg
Copy link
Member

@seberg seberg commented Jul 24, 2024

It was never correct that strides are divisable by their itemsize and
some code even checked for it correctly, while others had asserts
in place that had no reason why they should be guaranteed to pass.

This moves the check into the loadable_stride macro. The exp
implementation has it's own custom logic, that is probably OK.

I assume that compilers will optimize the duplicate division away,
if that may not be the case, the code should be optimized to avoid
it probably.
(although, I guess these are power of 2 divisions so they don't matter actually)

Closes gh-26775

seberg added 2 commits July 24, 2024 15:43
It was never correct that strides are divisable by their itemsize and
*some* code even checked for it correctly, while others had asserts
in place that had no reason why they should be guaranteed to pass.

This moves the check into the `loadable_stride` macro.  The exp
implementation has it's own custom logic, that is probably OK.

I assume that compilers will optimize the duplicate division away,
if that may not be the case, the code should be optimized to avoid
it probably.
@seberg seberg requested review from Mousius and r-devulap July 24, 2024 13:52
@seberg
Copy link
Member Author

seberg commented Jul 24, 2024

I have confirmed that this fixes the issue in the i386 docker env. It should be carefully reviewed anyway, since I am not sure how good the tests really are. The tests at gh-27028 crashed (presumably hitting one of the deleted asserts).

EDIT: After removing the asserts, they failed beautifully.

@charris
Copy link
Member

charris commented Jul 26, 2024

Does this need a backport? There are a couple of other SIMD fixes...

@seberg
Copy link
Member Author

seberg commented Jul 26, 2024

It's a huge bug, but I dunno, since it is also a larger change, and I think around for a while...

@mattip
Copy link
Member

mattip commented Jul 30, 2024

After removing the asserts, they failed beautifully.

nice

@mattip
Copy link
Member

mattip commented Jul 30, 2024

I assume that compilers will optimize the duplicate division away,

The code looks good to me. Could you benchmark a relevant ufunc to see if there is any impact?

@r-devulap
Copy link
Member

the exp implementation has it's own custom logic, that is probably OK.

You mean here?

if (IS_OUTPUT_BLOCKABLE_UNARY(sizeof(npy_float), sizeof(npy_float), 64)) {
Perhaps, we should use the same logic everywhere for consistency?

Copy link
Member

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but otherwise LGTM. Thanks @seberg for fixing this!

return MAXLOAD > 0 ? llabs(stride) <= MAXLOAD : 1; \
} \
NPY_FINLINE int \
npyv_storable_stride_##SFX(npy_intp stride) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if these can fold into one function. It duplicates the same logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can fold it into a single macro. Although, can MAXLOAD and MAXSTORE be reasonably different? Maybe it should just be NPY_SIMD_MAX_STRIDE?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, only avx512 defines MAXSTORE and it is equal to MAXLOAD. Let's fold into a single value. There was no magic way to calculate them, IIRC it was a simple heuristic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, AVX2 defines only the load version. I might suspect that it is quite silly to not just use that for the store either way (also stores are much less likely to have huge strides).

But, it makes me want to not include the change here.

@seberg
Copy link
Member Author

seberg commented Jul 31, 2024

Not sure there is a test that really might notice it (unless we disable simd accidentally), but I see these timings. That is huge fluctuations, but I can't say any of them look systematic (locally on mac).

The only one that seems a bit plausible might be the 3rd one:

bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'absolute'>, 1, 1, 'Q')
Details
| Change   | Before [50bd355e] <main>   | After [e27f4190] <sane-simd-steps>   |   Ratio | Benchmark (Parameter)                                                                           |
|----------|----------------------------|--------------------------------------|---------|-------------------------------------------------------------------------------------------------|
| +        | 2.09±0.01μs                | 3.55±0.06μs                          |    1.7  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'negative'>, 1, 2, 'f')                    |
| +        | 2.11±0.03μs                | 3.52±0.01μs                          |    1.67 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'negative'>, 1, 2, 'f')                           |
| +        | 12.1±1μs                   | 17.0±0.02μs                          |    1.4  | bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'absolute'>, 1, 1, 'Q')                    |
| +        | 9.59±0.01μs                | 12.7±0.1μs                           |    1.33 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 1, 'd')                            |
| +        | 9.81±0.09μs                | 13.0±0.2μs                           |    1.33 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 1, 2, 'd')                     |
| +        | 9.59±0.01μs                | 12.7±0.02μs                          |    1.32 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 1, 'd')                            |
| +        | 9.70±0.01μs                | 12.8±0.04μs                          |    1.32 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 2, 'd')                            |
| +        | 9.70±0μs                   | 12.8±0μs                             |    1.32 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 2, 'd')                            |
| +        | 9.70±0.01μs                | 12.8±0.01μs                          |    1.32 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 2, 'd')                            |
| +        | 9.71±0.03μs                | 12.8±0.02μs                          |    1.32 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 1, 'd')                            |
| +        | 9.71±0.01μs                | 12.8±0.01μs                          |    1.32 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 2, 'd')                            |
| +        | 9.69±0.06μs                | 12.8±0.1μs                           |    1.32 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 1, 1, 'd')                     |
| +        | 9.73±0.01μs                | 12.8±0.01μs                          |    1.32 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 4, 1, 'd')                     |
| +        | 9.62±0.01μs                | 12.7±0μs                             |    1.32 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 1, 1, 'd')                     |
| +        | 9.80±0.1μs                 | 12.8±0.01μs                          |    1.31 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 1, 2, 'd')                     |
| +        | 9.78±0.03μs                | 12.8±0.01μs                          |    1.31 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 4, 2, 'd')                     |
| +        | 9.88±0.2μs                 | 12.9±0.1μs                           |    1.3  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 1, 'd')                            |
| +        | 9.84±0.09μs                | 12.8±0.1μs                           |    1.3  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 4, 2, 'd')                     |
| +        | 9.88±0.09μs                | 12.8±0.02μs                          |    1.3  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 4, 1, 'd')                     |
| +        | 2.15±0.02μs                | 2.72±0.03μs                          |    1.27 | bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int16'>)                      |
| +        | 3.71±0.03μs                | 4.41±0μs                             |    1.19 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'reciprocal'>, 4, 1, 'd')                  |
| +        | 2.95±0.1μs                 | 3.51±0μs                             |    1.19 | bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'logical_not'>, 1, 1, 'h')                 |
| +        | 3.79±0.3μs                 | 4.47±0.09μs                          |    1.18 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rint'>, 4, 1, 'd')                        |
| +        | 4.90±0.7μs                 | 5.68±0.01μs                          |    1.16 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 2, 4, 1, 'D')           |
| +        | 3.51±0.4μs                 | 4.04±0.02μs                          |    1.15 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 1, 1, 1, 'D')           |
| +        | 3.55±0.5μs                 | 4.09±0.01μs                          |    1.15 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'subtract'>, 1, 1, 1, 'D')      |
| +        | 66.2±0.7μs                 | 76.0±4μs                             |    1.15 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arccos'>, 1, 1, 'd')                             |
| +        | 1.97±0.02μs                | 2.27±0.3μs                           |    1.15 | bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'isfinite'>, 1, 1, 'i')                    |
| +        | 3.69±0.03μs                | 4.21±0.04μs                          |    1.14 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 1, 2, 'd')                      |
| +        | 3.59±0.4μs                 | 4.04±0.02μs                          |    1.13 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'subtract'>, 1, 2, 1, 'D')      |
| +        | 3.69±0.01μs                | 4.18±0.01μs                          |    1.13 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 1, 2, 'd')                           |
| +        | 3.55±0μs                   | 3.98±0μs                             |    1.12 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 4, 2, 'f')                      |
| +        | 3.71±0.01μs                | 4.16±0μs                             |    1.12 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 1, 2, 'd')                      |
| +        | 3.55±0μs                   | 3.98±0.2μs                           |    1.12 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 4, 2, 'f')               |
| +        | 15.3±1μs                   | 17.1±0.1μs                           |    1.12 | bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'sign'>, 1, 1, 'l')                        |
| +        | 3.79±0.03μs                | 4.17±0.04μs                          |    1.1  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 1, 2, 'd')               |
| +        | 110±0.01μs                 | 120±0.1μs                            |    1.09 | bench_ufunc.UFunc.time_ufunc_types('radians')                                                   |
| +        | 4.65±0.4μs                 | 5.07±0.02μs                          |    1.09 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'multiply'>, 4, 4, 1, 'F')      |
| +        | 1.03±0.01μs                | 1.12±0.01μs                          |    1.09 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in1(<ufunc 'multiply'>, 1, 1, 1, 'I')    |
| +        | 67.9±0.8μs                 | 73.9±3μs                             |    1.09 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arccos'>, 4, 2, 'd')                             |
| +        | 1.33±0.02μs                | 1.46±0.03μs                          |    1.09 | bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'isnan'>, 1, 1, 'L')                       |
| +        | 111±1μs                    | 120±0.2μs                            |    1.08 | bench_ufunc.UFunc.time_ufunc_types('deg2rad')                                                   |
| +        | 708±7ns                    | 763±30ns                             |    1.08 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'bitwise_xor'>, 1, 1, 1, 'B') |
| +        | 1.51±0.02μs                | 1.61±0.02μs                          |    1.07 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'add'>, 1, 1, 1, 'Q')         |
| +        | 1.50±0.01μs                | 1.60±0.05μs                          |    1.07 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'bitwise_or'>, 1, 1, 1, 'Q')  |
| +        | 1.50±0.01μs                | 1.60±0.02μs                          |    1.07 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'subtract'>, 1, 1, 1, 'Q')    |
| +        | 851±3ns                    | 910±20ns                             |    1.07 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in1(<ufunc 'add'>, 1, 1, 1, 'h')         |
| +        | 1.50±0.02μs                | 1.60±0.02μs                          |    1.07 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in1(<ufunc 'add'>, 1, 1, 1, 'q')         |
| +        | 3.50±0.04μs                | 3.73±0.04μs                          |    1.07 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'negative'>, 1, 2, 'd')                           |
| +        | 3.50±0.03μs                | 3.76±0.05μs                          |    1.07 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 4, 1, 'f')               |
| +        | 9.03±0.03μs                | 9.61±0.1μs                           |    1.06 | bench_ufunc.UFunc.time_ufunc_types('bitwise_not')                                               |
| +        | 1.02±0μs                   | 1.08±0.01μs                          |    1.06 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'bitwise_xor'>, 1, 1, 1, 'I') |
| +        | 709±2ns                    | 749±10ns                             |    1.06 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'bitwise_xor'>, 1, 1, 1, 'b') |
| +        | 738±10ns                   | 779±20ns                             |    1.06 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in1(<ufunc 'add'>, 1, 1, 1, 'B')         |
| +        | 1.04±0.01μs                | 1.10±0.02μs                          |    1.06 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in1(<ufunc 'bitwise_and'>, 1, 1, 1, 'i') |
| +        | 794±1ns                    | 838±10ns                             |    1.06 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in1(<ufunc 'logical_and'>, 1, 1, 1, 'H') |
| +        | 739±7ns                    | 784±30ns                             |    1.06 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 1, 1, 'f')                              |
| +        | 3.52±0.01μs                | 3.74±0μs                             |    1.06 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 4, 1, 'f')               |
| +        | 16.0±0.9μs                 | 17.0±0.01μs                          |    1.06 | bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'positive'>, 1, 1, 'q')                    |
| +        | 58.5±0.4μs                 | 61.7±0.2μs                           |    1.05 | bench_ufunc.UFunc.time_ufunc_types('isfinite')                                                  |
| +        | 1.55±0.01μs                | 1.63±0.05μs                          |    1.05 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'bitwise_and'>, 1, 1, 1, 'Q') |
| +        | 701±4ns                    | 739±6ns                              |    1.05 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'bitwise_and'>, 1, 1, 1, 'b') |
| +        | 1.06±0.01μs                | 1.12±0.01μs                          |    1.05 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'bitwise_and'>, 1, 1, 1, 'i') |
| +        | 713±4ns                    | 749±10ns                             |    1.05 | bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in1(<ufunc 'subtract'>, 1, 1, 1, 'b')    |
| +        | 3.55±0.04μs                | 3.74±0.01μs                          |    1.05 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 4, 1, 'f')                      |
| +        | 38.8±0.1μs                 | 40.9±0.6μs                           |    1.05 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log'>, 4, 2, 'd')                                |
| +        | 69.2±0.2μs                 | 72.9±1μs                             |    1.05 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log'>, 1, 1, 'e')                         |
| -        | 390±6ns                    | 371±5ns                              |    0.95 | bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.)))                             |
| -        | 412±20ns                   | 393±1ns                              |    0.95 | bench_ufunc.CustomArrayFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 100)        |
| -        | 325±6ns                    | 308±0.8ns                            |    0.95 | bench_ufunc.Methods0DInvert.time_ndarray__0d__('int16')                                         |
| -        | 458±7ns                    | 433±2ns                              |    0.95 | bench_ufunc.MethodsV1.time_ndarray_meth('__le__', 'complex64')                                  |
| -        | 387±7ns                    | 368±0.8ns                            |    0.95 | bench_ufunc.MethodsV1.time_ndarray_meth('__lt__', 'int64')                                      |
| -        | 687±5ns                    | 653±10ns                             |    0.95 | bench_ufunc.MethodsV1.time_ndarray_meth('__matmul__', 'complex64')                              |
| -        | 70.1±2μs                   | 66.7±0.2μs                           |    0.95 | bench_ufunc.UFunc.time_ufunc_types('copysign')                                                  |
| -        | 139±2μs                    | 132±0.2μs                            |    0.95 | bench_ufunc.UFunc.time_ufunc_types('spacing')                                                   |
| -        | 481±10ns                   | 456±5ns                              |    0.95 | bench_ufunc.UFuncSmall.time_ufunc_small_int_array('sqrt')                                       |
| -        | 6.67±0.1μs                 | 6.32±0.02μs                          |    0.95 | bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 4, 1, 2, 'F')                      |
| -        | 7.66±0.2μs                 | 7.28±0.01μs                          |    0.95 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'multiply'>, 1, 2, 2, 'D')      |
| -        | 3.36±0.01μs                | 3.19±0.2μs                           |    0.95 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'subtract'>, 2, 2, 1, 'F')      |
| -        | 1.17±0.03μs                | 1.12±0.01μs                          |    0.95 | bench_ufunc_strides.BinaryFPSpecial.time_binary(<ufunc 'maximum'>, 1, 1, 1, 'f')                |
| -        | 479±20ns                   | 453±0.5ns                            |    0.95 | bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'bitwise_xor'>, 1, 1, 1, 'b')            |
| -        | 3.73±0.01μs                | 3.54±0.02μs                          |    0.95 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 4, 1, 'f')                 |
| -        | 1.61±0.04ms                | 1.51±0.01ms                          |    0.94 | bench_ufunc.CustomArrayFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 1000000)     |
| -        | 374±4ns                    | 353±3ns                              |    0.94 | bench_ufunc.MethodsV1.time_ndarray_meth('__le__', 'int16')                                      |
| -        | 671±20ns                   | 630±4ns                              |    0.94 | bench_ufunc.MethodsV1.time_ndarray_meth('__matmul__', 'float32')                                |
| -        | 438±10ns                   | 411±0.8ns                            |    0.94 | bench_ufunc.MethodsV1.time_ndarray_meth('__ne__', 'complex64')                                  |
| -        | 65.7±4μs                   | 61.5±0.2μs                           |    0.94 | bench_ufunc.UFunc.time_ufunc_types('trunc')                                                     |
| -        | 452±9ns                    | 424±1ns                              |    0.94 | bench_ufunc.UFuncSmall.time_ufunc_numpy_scalar('cos')                                           |
| -        | 321±5ns                    | 303±1ns                              |    0.94 | bench_ufunc.UFuncSmall.time_ufunc_small_array('abs')                                            |
| -        | 327±9ns                    | 308±0.5ns                            |    0.94 | bench_ufunc.UFuncSmall.time_ufunc_small_array('cos')                                            |
| -        | 484±3ns                    | 454±3ns                              |    0.94 | bench_ufunc.UFuncSmall.time_ufunc_small_int_array('cos')                                        |
| -        | 8.89±0.05μs                | 8.38±0μs                             |    0.94 | bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 4, 2, 4, 'F')                      |
| -        | 20.7±0.7μs                 | 19.5±0.04μs                          |    0.94 | bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 4, 4, 4, 'D')                      |
| -        | 14.3±0.05μs                | 13.5±0.03μs                          |    0.94 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 4, 4, 4, 'D')           |
| -        | 7.90±0.06μs                | 7.46±0.01μs                          |    0.94 | bench_ufunc_strides.BinaryFP.time_binary_scalar_in0(<ufunc 'fmin'>, 2, 4, 4, 'd')               |
| -        | 3.74±0.02μs                | 3.53±0.01μs                          |    0.94 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 4, 1, 'f')                        |
| -        | 1.61±0.1μs                 | 1.51±0.02μs                          |    0.94 | bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'conjugate'>, 1, 1, 'b')                   |
| -        | 1.41±0.04μs                | 1.31±0.02μs                          |    0.93 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint16'>, 43)      |
| -        | 320±5ns                    | 298±0.9ns                            |    0.93 | bench_ufunc.Methods0DBoolComplex.time_ndarray__0d__('__complex__', 'float32')                   |
| -        | 211±4μs                    | 197±1μs                              |    0.93 | bench_ufunc.UFunc.time_ufunc_types('divide')                                                    |
| -        | 15.5±0.07μs                | 14.4±0.8μs                           |    0.93 | bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'divide'>, 4, 4, 1, 'D')                   |
| -        | 551±10ns                   | 514±6ns                              |    0.93 | bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'logical_and'>, 1, 1, 1, 'B')            |
| -        | 494±2ns                    | 457±2ns                              |    0.93 | bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'subtract'>, 1, 1, 1, 'b')               |
| -        | 4.43±0.03μs                | 4.10±0.3μs                           |    0.93 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 4, 1, 'd')                              |
| -        | 1.42±0.04μs                | 1.30±0.01μs                          |    0.92 | bench_ufunc.CustomScalarFloorDivideUInt.time_floor_divide_uint(<class 'numpy.uint16'>, 8)       |
| -        | 59.9±1μs                   | 54.9±2μs                             |    0.92 | bench_ufunc.UFunc.time_ufunc_types('conj')                                                      |
| -        | 105±1μs                    | 96.7±0.5μs                           |    0.92 | bench_ufunc.UFunc.time_ufunc_types('minimum')                                                   |
| -        | 5.39±0.01μs                | 4.97±0.3μs                           |    0.92 | bench_ufunc_strides.BinaryFP.time_binary(<ufunc 'fmin'>, 1, 4, 1, 'd')                          |
| -        | 3.85±0μs                   | 3.52±0μs                             |    0.92 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 4, 2, 'f')                        |
| -        | 3.84±0.01μs                | 3.53±0.01μs                          |    0.92 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 4, 2, 'f')                 |
| -        | 8.83±0.02μs                | 8.08±0.7μs                           |    0.91 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'subtract'>, 4, 4, 1, 'D')      |
| -        | 1.37±0.07μs                | 1.25±0.01μs                          |    0.91 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 1, 1, 'd')                              |
| -        | 850±40ns                   | 777±30ns                             |    0.91 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 1, 1, 'f')                              |
| -        | 4.52±0.09μs                | 4.13±0.3μs                           |    0.91 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'ceil'>, 4, 1, 'd')                        |
| -        | 7.16±0.06μs                | 6.45±0.6μs                           |    0.9  | bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'subtract'>, 2, 1, 1, 'D')                 |
| -        | 503±8ns                    | 454±3ns                              |    0.9  | bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'multiply'>, 1, 1, 1, 'B')               |
| -        | 848±60ns                   | 743±9ns                              |    0.88 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'absolute'>, 1, 1, 'f')                    |
| -        | 861±30ns                   | 753±6ns                              |    0.87 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'ceil'>, 1, 1, 'f')                               |
| -        | 5.81±0.02μs                | 5.03±0.6μs                           |    0.86 | bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'multiply'>, 1, 2, 1, 'D')      |
| -        | 4.55±0.07μs                | 3.81±0.03μs                          |    0.84 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 4, 2, 'd')                        |
| -        | 4.54±0.1μs                 | 3.82±0.04μs                          |    0.84 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 4, 2, 'd')                 |
| -        | 4.51±0.03μs                | 3.69±0.04μs                          |    0.82 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 4, 1, 'd')                        |
| -        | 4.51±0.01μs                | 3.69±0.03μs                          |    0.82 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 4, 1, 'd')                 |
| -        | 3.22±0.01μs                | 2.12±0.02μs                          |    0.66 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'negative'>, 4, 2, 'f')                    |
| -        | 3.26±0.03μs                | 2.11±0.01μs                          |    0.65 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'negative'>, 4, 2, 'f')                           |

@charris
Copy link
Member

charris commented Aug 1, 2024

@seberg Are you OK with merging this as is?

@seberg
Copy link
Member Author

seberg commented Aug 1, 2024

Yes, of course, was just giving reviewers a chance to comment in case.

@charris charris merged commit 1082661 into numpy:main Aug 1, 2024
@charris
Copy link
Member

charris commented Aug 1, 2024

Thanks Sebastian.

@seberg seberg deleted the sane-simd-steps branch August 1, 2024 19:52
@seberg
Copy link
Member Author

seberg commented Aug 2, 2024

Hmmmpf, the test didn't fail here but does not on other CI runs. I guess machine capability fluctuations...

@r-devulap do you have time to have a look at fixing ldexp and frexp? Not sure what is best, aligning to use npyv_loadable_stride_##sfx/npyv_storable_stride_##sfx, or just adding the alignment check that some of the other macros already have here.
The choice of the max stride is also slightly different, although I think compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: array negation (i.e. minus sign) is wrong on 32 bit OS (likely related to SIMD on non memory-aligned array)

4 participants