BUG: Fix simd loadable stride logic #27027
Conversation
It was never correct that strides are divisable by their itemsize and *some* code even checked for it correctly, while others had asserts in place that had no reason why they should be guaranteed to pass. This moves the check into the `loadable_stride` macro. The exp implementation has it's own custom logic, that is probably OK. I assume that compilers will optimize the duplicate division away, if that may not be the case, the code should be optimized to avoid it probably.
|
I have confirmed that this fixes the issue in the i386 docker env. It should be carefully reviewed anyway, since I am not sure how good the tests really are. The tests at gh-27028 crashed (presumably hitting one of the deleted asserts). EDIT: After removing the asserts, they failed beautifully. |
|
Does this need a backport? There are a couple of other SIMD fixes... |
|
It's a huge bug, but I dunno, since it is also a larger change, and I think around for a while... |
nice |
The code looks good to me. Could you benchmark a relevant ufunc to see if there is any impact? |
You mean here? Perhaps, we should use the same logic everywhere for consistency? |
| return MAXLOAD > 0 ? llabs(stride) <= MAXLOAD : 1; \ | ||
| } \ | ||
| NPY_FINLINE int \ | ||
| npyv_storable_stride_##SFX(npy_intp stride) \ |
There was a problem hiding this comment.
Wondering if these can fold into one function. It duplicates the same logic.
There was a problem hiding this comment.
I can fold it into a single macro. Although, can MAXLOAD and MAXSTORE be reasonably different? Maybe it should just be NPY_SIMD_MAX_STRIDE?
There was a problem hiding this comment.
As of now, only avx512 defines MAXSTORE and it is equal to MAXLOAD. Let's fold into a single value. There was no magic way to calculate them, IIRC it was a simple heuristic.
There was a problem hiding this comment.
Actually, AVX2 defines only the load version. I might suspect that it is quite silly to not just use that for the store either way (also stores are much less likely to have huge strides).
But, it makes me want to not include the change here.
|
Not sure there is a test that really might notice it (unless we disable simd accidentally), but I see these timings. That is huge fluctuations, but I can't say any of them look systematic (locally on mac). The only one that seems a bit plausible might be the 3rd one: Details |
|
@seberg Are you OK with merging this as is? |
|
Yes, of course, was just giving reviewers a chance to comment in case. |
|
Thanks Sebastian. |
|
Hmmmpf, the test didn't fail here but does not on other CI runs. I guess machine capability fluctuations... @r-devulap do you have time to have a look at fixing |
It was never correct that strides are divisable by their itemsize and
some code even checked for it correctly, while others had asserts
in place that had no reason why they should be guaranteed to pass.
This moves the check into the
loadable_stridemacro. The expimplementation has it's own custom logic, that is probably OK.
I assume that compilers will optimize the duplicate division away,
if that may not be the case, the code should be optimized to avoid
it probably.
(although, I guess these are power of 2 divisions so they don't matter actually)
Closes gh-26775