SIMD: Replace raw SIMD of sin/cos with NPYV(universal intrinsics)#17587
Merged
mattip merged 3 commits intonumpy:masterfrom Dec 26, 2020
Merged
SIMD: Replace raw SIMD of sin/cos with NPYV(universal intrinsics)#17587mattip merged 3 commits intonumpy:masterfrom
mattip merged 3 commits intonumpy:masterfrom
Conversation
7161e30 to
e01dc6e
Compare
seiko2plus
commented
Oct 21, 2020
seiko2plus
commented
Oct 21, 2020
seiko2plus
commented
Oct 21, 2020
1163a6d to
84c4c2d
Compare
Qiyu8
reviewed
Oct 26, 2020
84c4c2d to
8900a72
Compare
seiko2plus
commented
Oct 27, 2020
518fd92 to
2a01e5f
Compare
360472c to
bb08eb2
Compare
This was referenced Nov 14, 2020
2 tasks
bb08eb2 to
8f829c9
Compare
Member
Author
seiko2plus
commented
Nov 17, 2020
b958d43 to
a0322ee
Compare
Member
Author
|
ping @mattip |
seiko2plus
commented
Dec 26, 2020
Member
Author
There was a problem hiding this comment.
Suggested change
| ** $maxopt $werror baseline | |
| ** $maxopt baseline |
remove treating warnings as errors after the CI pass the tests
Member
Author
There was a problem hiding this comment.
Done, I temporarily use this policy during the development to detect any warnings.
mattip
reviewed
Dec 26, 2020
Member
|
Nice speedups. Is this for 32-bit float only or also for 64-bit? Edit: 32 bit only. |
The new code improves the performance of non-contiguous memory access for the output array without any reduction in performance. For PPC64LE the performance increased by 2-3.0, and 1.5-2.0 on aarch64.
This test should not be exclusive to AVX. this patch also extends unary test to cover different sets of output strides.
a0322ee to
1470654
Compare
Member
Author
|
@mattip, just replaced the raw SIMD code of f32 with NPYV. |
Member
|
Thanks @seiko2plus |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Merge after #17790, #17789
SIMD: Replace raw SIMD of sin/cos with NPYV
The new code improves the performance of non-contiguous memory access
for the output array without any reduction in performance.
For PPC64LE the performance increased by 2-3.0, and 1.5-2.0 on aarch64.
TODO:
Performance tests(ASV)
Args
X86
I had to count on my local machine because I couldn't able to get stable ratios using aws.
see standalone benchamrk for AVX512F.
CPU
OS
Benchmark
AVX2 & FMA3 - Changed only
before after ratio [098a3b41] [a0322ee9] <master> <to_npyv_sincos_f32> 259~3us 55.1~0.2us 0.21 bench_ufunc_strides.Unary.time_ufunc('cos', 1, 2, 'f') 260~4us 56.2~0.2us 0.22 bench_ufunc_strides.Unary.time_ufunc('cos', 1, 4, 'f') 334~0.8us 60.4~0.07us 0.18 bench_ufunc_strides.Unary.time_ufunc('cos', 2, 2, 'f') 335~0.9us 61.5~0.2us 0.18 bench_ufunc_strides.Unary.time_ufunc('cos', 2, 4, 'f') 337~0.4us 62.1~0.2us 0.18 bench_ufunc_strides.Unary.time_ufunc('cos', 4, 2, 'f') 339~2us 61.2~0.6us 0.18 bench_ufunc_strides.Unary.time_ufunc('cos', 4, 4, 'f') 266~10us 54.9~0.2us 0.21 bench_ufunc_strides.Unary.time_ufunc('sin', 1, 2, 'f') 270~20us 55.6~0.2us 0.21 bench_ufunc_strides.Unary.time_ufunc('sin', 1, 4, 'f') 331~3us 60.3~0.1us 0.18 bench_ufunc_strides.Unary.time_ufunc('sin', 2, 2, 'f') 332~2us 61.0~0.3us 0.18 bench_ufunc_strides.Unary.time_ufunc('sin', 2, 4, 'f') 336~1us 61.7~0.3us 0.18 bench_ufunc_strides.Unary.time_ufunc('sin', 4, 2, 'f') 335~0.2us 61.5~0.4us 0.18 bench_ufunc_strides.Unary.time_ufunc('sin', 4, 4, 'f')Power little-endian
CPU
OS
Benchmark
VSX2(ISA >= 2.07) - Changed only
Performance tests(standalone #15987)
Args used within #15987
Note:
--msleep 1force the running thread to sleep 1 millisecond before collecting each sampleto revert any frequency reduction, since it seems that throttling effect on wall time when
AVX512Fis enabled.X86
CPU
OS
Benchmark
AVX512F - Contiguous only
metric: gmean, units: ms
1.131.07AVX512F
metric: gmean, units: ms
14.0214.7612.1713.8814.7612.5412.0314.1812.131.071.081.0916.4816.2216.8216.5316.8817.0215.816.0816.021.071.091.1116.616.6516.5316.6916.9217.0915.816.0116.231.081.111.1115.1415.5415.4614.9215.5415.5914.0214.414.4910.2613.3611.5510.513.4911.619.3512.6311.311.061.0712.2115.4115.5312.7315.7615.2612.214.8514.821.0812.4515.3915.4412.6515.7115.2612.2914.7714.921.081.0911.7914.2614.2811.8114.3713.9111.1713.2413.3AVX2 & FMA3 - Contiguous only
metric: gmean, units: ms
AVX2 & FMA3
metric: gmean, units: ms
7.247.467.597.27.517.617.397.577.728.27.838.558.348.528.528.288.338.320.938.438.518.158.318.548.17.938.368.047.667.797.87.567.787.747.527.657.691.077.57.517.87.57.67.757.567.677.888.488.18.958.778.848.778.688.68.728.778.778.878.878.878.788.768.758.668.638.568.748.698.68.648.528.468.55ARM8 64-bit
CPU
OS
Benchmark
ASIMD - Contiguous only
metric: gmean, units: ms
1.932.02.111.972.032.09ASIMD
metric: gmean, units: ms
1.531.681.751.371.491.561.371.491.561.371.481.571.51.561.631.361.421.471.371.411.481.361.421.491.351.511.571.221.361.421.221.361.421.221.371.431.261.311.381.21.231.291.181.221.291.161.231.282.02.012.061.791.781.831.791.781.831.781.741.831.851.891.931.651.681.711.661.681.721.661.681.721.751.761.791.591.61.631.571.61.641.591.611.641.571.571.611.451.451.51.461.451.51.451.451.5Power little-endian
CPU
OS
Benchmark
VSX2(ISA >= 2.07) - Contiguous only
metric: gmean, units: ms
2.942.993.033.163.133.2VSX2(ISA >= 2.07)
metric: gmean, units: ms
2.862.992.922.72.832.882.722.822.892.722.842.872.732.792.872.552.612.582.562.742.622.552.62.652.72.842.832.532.652.652.532.662.652.462.732.752.762.772.892.62.592.72.592.592.72.582.592.73.163.23.172.92.932.92.92.942.822.832.872.92.872.892.92.652.682.682.662.682.682.652.682.692.822.862.92.612.652.692.742.652.692.612.662.692.782.882.912.672.662.722.582.672.712.582.672.71