ENH:Umath Replace raw SIMD of unary float point(32-64) with NPYV - g0#16247
ENH:Umath Replace raw SIMD of unary float point(32-64) with NPYV - g0#16247mattip merged 4 commits intonumpy:masterfrom
Conversation
225ef80 to
f1759d6
Compare
f1759d6 to
043c07d
Compare
|
@seiko2plus this needs a redo now that the infrastructure is in place. |
|
@seiko2plus Ping. Looks like the recent header change merges reguires a rebase. Any possibility of breaking out the fixes required for the 32 bit wheel builds? |
|
@charris, I'm going to rebase it and push the local changes just after getting done from #16782, |
|
@charris, Any possibility of breaking out the fixes required for the 32 bit wheel builds? no, I have to implement a SIMD kernel that handles scalars as well as vectors, just give me a week. |
f128eca to
2688270
Compare
9387b54 to
a75caf1
Compare
bc8f69e to
5ca9489
Compare
877a34e to
761cac3
Compare
There was a problem hiding this comment.
The OpenCV simd sqrt is here. but you are right, if we didn't check zero, the result will become a nan.
There was a problem hiding this comment.
thanks, to #16782, I catched two issues here positive infinity(same as zero) and precision.
unfortunately, I had to add a third Newton-Raphson iteration to provides acceptable precision.
There was a problem hiding this comment.
Can you add a test for that failure?
There was a problem hiding this comment.
I made a check to all sqrt special cases, see:
https://github.com/numpy/numpy/blob/1f872658984b2f8b0fda7022e72ad333a62864f3/numpy/core/tests/test_simd.py#L184-L188
I made another fix to npyv_sqrt_f32() within the last push, which fixes floating-point division-by-zero error that
raised by vrsqrteq_f32(x) when x is zero.
Now all the tests passed on armhf.
761cac3 to
24d957e
Compare
There was a problem hiding this comment.
This will fix the failing 32-bit wheel build?
There was a problem hiding this comment.
This change activates the new dispatcher.
32-bit wheel build fails due to aggressive optimization gcc made that doesn't respect zero division,
this issue was exist also on 64-bit when AVX2 and AVX512F aren't enabled.
where partial load intrinsic npyv_load_till_* and npyv_loadn_till_* guarantee adding "one" to the tail of the vector.
I also made a slight change in the last push to generate using NPYV version for overlapped arrays, also to guarantee the same precsion on armhf.
ac9540f to
1f87265
Compare
There was a problem hiding this comment.
@r-devulap, I only enabled SSE and dropped AVX2 and AVX512F since there's no performance gain for contiguous arrays,
also, the emulated version of partial and non-contiguous memory load/store intrinsics show better performance
comparing with the gather/scatter(AVX512F) intrinsics, especially when I unroll by x2/x4.
There was a problem hiding this comment.
I am curious why the shift in order is needed, usually Python.h comes first
There was a problem hiding this comment.
to suppress warning 'declaration of 'struct timespec*', this compiler warnning raised when math.h get included before
Python.h.
There was a problem hiding this comment.
Do you have some more info about this you can link to?
There was a problem hiding this comment.
Do the new intrinsics need to be added to the _simd module explicitly or is it automatic?
There was a problem hiding this comment.
explicitly, please take a look at :
numpy/core/src/umath/loops_utils.h
Outdated
There was a problem hiding this comment.
We have solve_may_share_memory, in common/mem_overlap.c, can we use this opportunity to refactor the code to use it? If not, then this function should be moved to that file.
There was a problem hiding this comment.
apparently solve may share_memory() is used to resolve arrays overlapping from Python level,
while what we do here is to avoid perform SIMD vector operations on overlapped arrays,
since the user expected scalar by scalar overlap which acts differently than unrolling,
also, we use sccatter/gather operations which may lead to undefined behaviour.
If not, then this function should be moved to that file.
I don't think this function is related to the content of common/mem_overlap.c.
numpy/core/tests/test_simd.py
Outdated
There was a problem hiding this comment.
nice, the power of _simd to show the intrinsics are correct is compelling.
There was a problem hiding this comment.
it should be very helpful for the upcoming architectures.
I'm already provided good benchmarks in the description of this pull-request based on #15987, the last changes I made |
|
IMO,A standalone benchmark script for universal intrinsics is the performance assurance in technical perspective, but what @mattip cared about is the final performance that numpy user can awared, which should be measured by |
|
@seiko2plus this now has a merge conflict
Yes, this is what we now need to verify these changes have not impacted x86_64 in a bad way. I would do it, but I don't seem to be able to stabilize my machine enough to get consistent results. |
this patch also improves division precision for NEON/A32
- only covers sqrt, absolute, square and reciprocal - fix SIMD memory overlap check for aliasing(same ptr & stride) - unify fp/domain errors for both scalars and vectors
1f87265 to
c811166
Compare
That actually the exact reason behind #15987, it only compares the inner loops of ufunc which reduce the number of outliers
try to stabilize your system via sudo python -m pyperf system tuneYou will need to compare against multiple dispatched targets, you gonna have to use the environment variable for example: # disable AVX512F to benchmarking AVX2
export NPY_DISABLE_CPU_FEATURES="AVX512F"
# run asv or #15987
# disable AVX512F and AVX2 to benchmarking SSE
export NPY_DISABLE_CPU_FEATURES="AVX2 AVX512F"
# run asv or #15987 |
Doesn't do much on an AMD machine |
|
@hameerabbasi could you benchmark this? |
|
I originally posted this on #15987 by mistake: I ran this PR on a live environment without a desktop (Ubuntu Server), using the method in the PR description. The noise was around 3% and this PR had a performance impact of ±5%, so not too much of a difference. |
mattip
left a comment
There was a problem hiding this comment.
Thanks @hameerabbasi. So it seems this is good to be merged: performance on x86_64 is unchanged, and this unlocks universal intrinsics for the other architectures (and should solve that pesky test failure on old gcc on 32-bit linux).
|
Thanks @seiko2plus |
This pull-request:
replaces the raw SIMD of sqrt, absolute, square, reciprocal
with NumPy C SIMD vectorization interface(NPYV).
fix SIMD memory overlap check for aliasing(same ptr & stride)
unify fp/domain errors for both scalars and vectors,
which lead to fix AVX test failures for 32 bit manylinux1 wheels #17174.
improves float32 division precision on NEON/A32
add new NPYV intrinsics sqrt, abs, recip and square
reorder
Python.hto suppress warning 'declaration of 'struct timespec*'merge after #17340
closes #17174
TODO:
Performance tests
Args used within #15987
Note:
--msleep 1force the running thread to sleep 1 millisecond before collecting each sampleto revert any frequency reduction, since it seems that throttling effect on wall time when
AVX512Fis enabled.X86
CPU
OS
Benchmark
AVX512F - Contiguous only
metric: gmean, units: ms
2.371.821.442.62.461.51.331.241.21.51.611.281.261.211.191.51.621.42.251.971.511.961.841.58AVX512F
metric: gmean, units: ms
2.361.811.442.231.531.521.871.481.371.71.761.551.461.261.221.131.111.121.072.422.421.62.232.261.911.11.141.131.31.331.221.71.681.521.081.11.121.451.251.181.411.431.441.061.321.241.22.01.992.01.131.131.121.331.241.21.971.981.991.061.061.361.211.631.641.631.51.611.272.782.892.451.31.291.331.351.431.251.741.71.71.181.21.21.241.271.121.51.511.531.111.081.081.271.211.196.336.326.355.155.35.331.261.211.196.336.346.354.924.974.941.261.211.196.326.346.333.863.873.861.631.611.412.1412.5110.575.515.625.661.51.611.247.367.327.335.125.215.351.491.311.26.426.566.474.644.664.632.171.921.341.911.281.291.611.521.351.651.311.291.431.281.181.070.931.921.821.631.71.711.421.321.41.21.21.191.111.431.251.19AVX2 - Contiguous only
metric: gmean, units: ms
2.271.61.193.322.371.481.141.061.551.171.561.241.12.11.671.283.932.041.88AVX2
metric: gmean, units: ms
2.221.581.253.782.62.591.111.121.132.681.641.62.472.442.481.071.061.091.261.131.081.651.651.641.073.512.361.482.862.892.461.261.311.31.51.251.121.931.91.951.21.221.241.091.611.631.651.11.11.11.142.01.992.01.131.131.121.151.061.971.981.991.061.170.821.631.641.631.511.191.482.782.892.451.31.291.331.441.21.121.741.741.71.181.21.21.070.911.51.511.531.091.081.086.336.326.355.155.35.336.336.346.354.924.974.946.326.346.333.863.873.871.521.261.112.1412.5410.565.515.625.661.51.271.127.367.327.335.125.215.351.070.926.426.566.474.644.664.632.091.661.272.861.911.932.281.641.562.381.931.931.161.11.251.231.220.942.722.051.912.882.912.421.311.281.321.481.251.091.861.841.891.171.191.230.931.581.61.641.081.061.06SSE3 - Contiguous only
metric: gmean, units: ms
2.281.591.193.332.351.481.141.061.551.161.161.551.261.12.111.651.282.762.041.87SSE3
metric: gmean, units: ms
2.231.581.243.782.62.591.111.121.132.631.641.62.472.442.481.071.061.091.21.071.651.651.641.073.832.361.482.862.892.461.261.311.31.51.241.111.941.891.951.21.221.241.080.941.611.631.651.11.11.11.142.01.992.01.131.131.121.151.061.971.981.991.061.160.821.631.641.631.531.171.182.782.892.451.31.291.331.441.21.111.741.71.71.181.21.21.060.891.51.511.531.091.081.071.536.336.326.355.155.35.336.337.676.344.924.974.946.326.346.333.863.873.871.541.251.112.1412.5110.575.515.625.661.51.251.127.367.327.335.125.215.351.080.936.426.566.474.644.664.632.121.621.272.861.911.932.311.611.592.381.931.931.181.251.231.220.932.682.041.922.882.912.421.311.281.321.491.251.091.861.841.891.171.191.230.940.941.581.61.641.081.061.06ARM8 64-bit
CPU
OS
Benchmark
ASIMD - Contiguous only
metric: gmean, units: ms
4.934.775.08.99.449.621.441.441.431.581.631.611.771.741.73ASIMD
metric: gmean, units: ms
4.935.014.712.72.672.261.420.922.172.182.161.761.771.571.081.071.092.231.451.561.211.191.191.141.131.28.899.479.632.852.882.882.521.391.283.743.863.931.961.981.981.951.321.341.71.551.571.271.21.221.131.111.351.341.331.350.941.351.341.331.271.321.321.31.110.860.850.850.921.451.441.431.441.441.431.451.251.121.451.441.431.331.331.341.341.131.131.271.221.221.581.571.611.991.961.531.070.931.621.631.581.281.251.11.961.151.151.111.091.121.061.771.741.722.072.082.071.890.942.792.852.881.431.451.431.421.221.081.070.930.870.870.93Power little-endian
CPU
OS
Benchmark
VSX2(ISA >= 2.07) - Contiguous only
metric: gmean, units: ms
3.112.882.986.56.966.921.081.131.261.211.222.552.552.541.891.831.871.761.921.84VSX2(ISA >= 2.07)
metric: gmean, units: ms
2.92.812.791.871.881.821.861.881.822.061.972.041.241.221.261.281.231.291.531.541.531.161.11.11.111.16.016.46.372.622.692.852.662.772.822.72.812.831.561.631.621.611.631.622.042.162.21.541.581.571.541.561.571.261.271.271.261.271.271.281.281.231.311.21.21.241.251.251.281.321.291.241.241.251.21.191.152.062.072.082.061.912.052.382.382.381.871.881.881.871.91.882.372.382.381.831.841.861.831.851.861.141.141.131.111.111.111.111.11.11.131.141.131.131.131.121.131.121.121.151.141.131.131.121.121.131.071.122.532.522.512.262.272.272.282.272.392.522.512.512.062.072.02.032.072.062.522.442.482.032.042.051.941.932.031.81.751.761.431.441.441.431.441.41.731.841.891.091.061.11.091.091.471.481.451.741.771.822.142.182.22.152.192.212.322.392.411.271.281.291.271.281.341.931.941.951.231.251.251.231.261.25Binary size of
_multiarray_umath.cpython-ver-arch-linux-gnu.soin kbytesNote: Debugging symbols are striped
EDIT: left some notes