-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
ENH: Modulate dispatched x86 CPU features #28896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e1ee011 to
1cd1a0e
Compare
|
needs a rebase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR reorganizes NumPy’s CPU build options by replacing individual x86 features with consolidated microarchitecture groups and bumps the baseline to x86-64-v2. Key changes include the introduction of X86_V2, X86_V3, and X86_V4 feature groups, updated enum values and CPU feature detection logic, and corresponding updates in tests and build configurations.
Reviewed Changes
Copilot reviewed 10 out of 19 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| numpy/_core/tests/test_cpu_features.py | Updates to feature groups and test definitions reflecting the new model. |
| numpy/_core/tests/test_cpu_dispatcher.py | Adjusts dispatcher tests to use new group names. |
| numpy/_core/src/common/npy_cpu_features.h | Revises enum definitions to include new CPU groups. |
| numpy/_core/src/common/npy_cpu_features.c | Updates CPU feature detection logic and mappings. |
| meson_cpu/x86/test_x86_v[2-4].c | Introduces new tests for the respective microarchitecture groups. |
| doc/source/reference/simd/gen_features.py | Removes deprecated code generation for CPU features documentation. |
| .github/workflows/linux_simd.yml | Modifies build flags and cpu-dispatch settings to integrate new groups. |
Files not reviewed (9)
- doc/release/upcoming_changes/28896.change.rst: Language not supported
- doc/source/reference/simd/generated_tables/compilers-diff.inc: Language not supported
- doc/source/reference/simd/generated_tables/cpu_features.inc: Language not supported
- doc/source/reference/simd/log_example.txt: Language not supported
- meson.options: Language not supported
- meson_cpu/meson.build: Language not supported
- meson_cpu/x86/meson.build: Language not supported
- numpy/_core/meson.build: Language not supported
- numpy/_core/src/umath/loops_autovec.dispatch.c.src: Language not supported
Comments suppressed due to low confidence (1)
numpy/_core/src/common/npy_cpu_features.c:515
- Review the modified condition for AVX512 OS support; verify that incorporating the avx_os check correctly handles systems without AVX OS support without unintended side effects.
if (!avx512_os && avx_os) {
1cd1a0e to
b959fb3
Compare
b959fb3 to
729d61c
Compare
|
If I am understanding
correctly, than the limitation on older machines depends on how the binaries are built? For any of the downstream packagers (linux distros, conda-forge, etc) that is their responsibilities to sort out. For wheels I think there is a case to go more aggressively newer and push users towards better packaging ecosystems if they need to support older chips. |
|
You're right - we're bumping the default |
729d61c to
c41541a
Compare
|
@tacaswell, I've updated the release note to be more clear about the CPU baseline change. |
seiko2plus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three cases require changes; the rest are backport notes and comments that may need your decision.
| are not supported by the target CPU (raises Python runtime error). | ||
| - ``cpu-baseline``: The minimum set of CPU features required to run the compiled NumPy. | ||
|
|
||
| * Default: ``min`` (provides compatibility across a wide range of platforms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Default: ``min`` (provides compatibility across a wide range of platforms) | |
| * Default: ``min`` (provides compatibility across a wide range of platforms), see :ref:`special options <opt-special-options>` to check which min maps to for each architecture. |
provides a ref for option min
| else | ||
| filterd = [] | ||
| # filter out the features that are in the accumulate list | ||
| # including any successor features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not excluding successor features should be considered a bug and needs to be back-ported.
For example, with cpu-dispatch before this patch, to exclude AVX512 you had to exclude all successor features, otherwise it would not be disabled:
python -m build --wheel -Csetup-args=-Dcpu-dispatch="max -avx512f -avx512cd \
-avx512_knl -avx512_knm -avx512_skx -avx512_clx -avx512_cnl -avx512_icl -avx512_spr"
After this fix:
python -m build --wheel -Csetup-args=-Dcpu-dispatch="max -avx512f"
| test_code: files(source_root + '/numpy/distutils/checks/cpu_avx512_cnl.c')[0] | ||
| ) | ||
| HWY_SSE4_FLAGS = ['-DHWY_WANT_SSE4', '-DHWY_DISABLE_PCLMUL_AES'] | ||
| # Use SSE for floating-point on x86-32 to ensure numeric consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix needs to be backported too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be backported in #30026, and yes, I wonder if you have encountered such ambiguous runtime floating-point errors on 32-bit builds?
| test_code: files(current_dir + '/test_x86_v4.c')[0], | ||
| ) | ||
| if cpu_family == 'x86' | ||
| X86_V4.update(disable: 'not supported on x86-32') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This disable needs to be backported too. We should not generate AVX512 kernels on 32-bit systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reconsidered — it's not necessary. Maybe we should avoid behavior changes, since the major changes of moving Highway should land in newer releases so this PR was focused on reducing any extra burdens before such a move, since dealing with a large SIMD framework can be challenging. However, end users can still exclude them from --cpu-dispatch.
meson_cpu/x86/meson.build
Outdated
| AVX2.update(args: {'val': '/arch:AVX2', 'match': clear_arch}) | ||
| AVX512_SKX.update(args: {'val': '/arch:AVX512', 'match': clear_arch}) | ||
| X86_V3.update(args: {'val': '/arch:AVX2', 'match': clear_arch}) | ||
| X86_V4.update(args: {'val': '/arch:AVX512', 'match': clear_arch}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, Highway considers AVX512 a broken platform on MSVC. We've managed to support AVX512 on MSVC through universal intrinsics. We have three options here:
-
Keep it as-is: This leads to building AVX2 kernels with
/arch:AVX512, which gives the compiler the opportunity to optimize (hopefully). Worst case: we'll have duplicate kernels. -
Disable AVX512 support entirely:
| X86_V4.update(args: {'val': '/arch:AVX512', 'match': clear_arch}) | |
| X86_V4.update(disable: 'not supported by Highway') |
3.Change the default behavior and force-enable AVX512 on MSVC and lets see how deep the rabbit-hole goes:
| X86_V4.update(args: {'val': '/arch:AVX512', 'match': clear_arch}) | |
| X86_V4.update(args: [{'val': '/arch:AVX512', 'match': clear_arch}, '-DHWY_BROKEN_MSVC=0']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer (2) or (3). If it's really very broken, disabling it seems fine (and low-effort). If it's feasible and there is energy for this, fixing it up in Highway over time seems nice.
(1) seems worst, if it doesn't do much, I'd much rather take the binary size gains from disabling it completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still haven't converted many kernels yet from universal intrinsics to Highway, so adding extra burden isn't a wise decision. Let's go for option (2) for now and investigate later which MSVC versions are compatible with Highway.
r-devulap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed an initial pass; detailed review still pending.
| __m512i avx512dq_i64_a = _mm512_set1_epi64(seed); | ||
| __m512i avx512dq_i64_b = _mm512_set1_epi64(2); | ||
| __m512i avx512dq_i64_c = _mm512_add_epi64(avx512dq_i64_a, avx512dq_i64_b); | ||
| result += _mm_cvtsi128_si32(_mm512_extracti32x4_epi32(avx512dq_i64_c, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seem unnecessarily too long. We should be able to keep this down to 5 or 6 instructions., something that tests one intrinsic from each feature:
// Test one intrinsic from each extension
// AVX-512F (Foundation)
__m512 avx512f_val = _mm512_set1_ps(1.0f);
// AVX-512CD (Conflict Detection)
__m512i avx512cd_val = _mm512_conflict_epi32(_mm512_set1_epi32(1));
// AVX-512VL (Vector Length Extensions)
__m256i avx512vl_val = _mm256_mask_mov_epi32(_mm256_set1_epi32(0), 0xFF, _mm256_set1_epi32(1));
// AVX-512BW (Byte and Word)
__m512i avx512bw_val = _mm512_set1_epi8(1);
// AVX-512DQ (Doubleword and Quadword)
__m512d avx512dq_val = _mm512_set1_pd(1.0);
// Prevent unused variable warnings
(void)avx512f_val;
(void)avx512cd_val;
(void)avx512vl_val;
(void)avx512bw_val;
(void)avx512dq_val;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copilot is an easy way to generate these :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each line ensures both the compiler and assembler fully support all AVX-512 feature groups. Mask operations make the test longer, but they’re now required: Highway by default falls back to AVX2 if full AVX-512 isn’t fully supported they marks these compilers as broken. Before this patch, partial AVX-512/BW/DQ support is possible as universal intrinics has special workarounds for these cases you can follow AVX512BW_MASK and AVX512DQ_MASK for more details. However this compile-time test drop this support, and I have left a release note for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason the test is longer is the inclusion of shuffle and scalar extraction operations, which are important to prevent the compiler from optimizing them out and to ensure proper assembler support.
| __m128 f16c_src = _mm_set1_ps((float)seed); | ||
| __m128i f16c_half = _mm_cvtps_ph(f16c_src, 0); | ||
| __m128 f16c_restored = _mm_cvtph_ps(f16c_half); | ||
| result += (int)_mm_cvtss_f32(f16c_restored); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as below, can we keep this simple please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve stuck to a safe approach, consistent with our current tests, which guards against compiler optimizations to ensure the assembler sees the emitted instructions and that a unique intrinsic from each SIMD extension is tested. By 2025, can we assume all compilers and assemblers fully support V3, so we could just test the compiler definitions instead?
| - Includes | ||
| * - ``X86_V2`` | ||
| - | ||
| - ``SSE`` ``SSE2`` ``SSE3`` ``SSSE3`` ``SSE4_1`` ``SSE4_2`` ``POPCNT`` ``CX16`` ``LAHF`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume X86_V2 is the SSE4 highway equivalent and from Highway doc SSE4 includes AES and CLMUL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, Highway’s SSE4 isn’t compatible with X86_V2 (Microarchitecture Level 2). We need to pass -DHWY_DISABLE_PCLMUL_AES to enable it.
| ) | ||
| X86_V3 = mod_features.new( | ||
| 'X86_V3', 10, implies: X86_V2, | ||
| args: ['-mavx', '-mavx2', '-mfma', '-mbmi', '-mbmi2', '-mlzcnt', '-mf16c', '-mmovbe'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way X86_V2 is defined, X86_V3 will not enabled all the required targets for Highway AVX2 (which requires ['-mavx2', '-maes', '-mpclmul', '-mbmi', '-mbmi2']. Or is that not right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding -DHWY_DISABLE_PCLMUL_AES forces Highway to ignore detection of the AES and PCLMUL crypto features.
Please follow meson var HWY_SSE4_FLAGS:
numpy/meson_cpu/x86/meson.build
Lines 6 to 15 in 37966c6
| HWY_SSE4_FLAGS = ['-DHWY_WANT_SSE4', '-DHWY_DISABLE_PCLMUL_AES'] | |
| # Use SSE for floating-point on x86-32 to ensure numeric consistency. | |
| # The x87 FPU's 80-bit internal precision causes unpredictable rounding | |
| # and overflow behavior when converting to smaller types. SSE maintains | |
| # strict 32/64-bit precision throughout all calculations. | |
| X86_64_V2_FLAGS = cpu_family == 'x86'? ['-mfpmath=sse'] : ['-mcx16'] | |
| X86_64_V2_NAMES = cpu_family == 'x86'? [] : ['CX16'] | |
| X86_V2 = mod_features.new( | |
| 'X86_V2', 1, args: ['-msse', '-msse2', '-msse3', '-mssse3', '-msse4.1', '-msse4.2', | |
| '-mpopcnt', '-msahf'] + X86_64_V2_FLAGS + HWY_SSE4_FLAGS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree HWY_DISABLE_PCLMUL_AES is sufficient for avoiding the need for CLMUL and AES. If defined, CLMul* and AESRound must not be called, which you don't, so all good.
999c292 to
580205a
Compare
…-64-v2` baseline Prevents "invalid value encountered in left_shift" warnings on clang-cl when testing bit shifts with long types under `x86-64-v2` baseline.
Force SSE-based floating-point on 32-bit x86 systems to fix inconsistent results between einsum and other math functions. Prevents test failures with int16 operations by avoiding the x87 FPU's extended precision.
- Disable X86_V4 (AVX-512) for MSVC builds due to Highway incompatibility - Add FIXME comment for future MSVC compatibility investigation - Update SIMD CI workflow to reflect `x86-64-v2` baseline - Remove redundant test configurations - Add missing X86_V4 support to unary complex loops
580205a to
37966c6
Compare
r-devulap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
The failures are triggered when the Intel x86 sort AVX‑512 kernels for 16‑bit are enabled at build time and the CPU/OS also supports them. A quick look at the emulated function suggests that the compiler does not correctly generate the instructions for `zmm_vector<float16>::ge(reg_t, reg_t)`. This patch does not actually fix the underlying bug; instead, it disables these kernels on 32‑bit MSVC builds as a stop‑gap, since the issue requires further investigation and an upstream fix. Note: Newer NumPy releases will drop the entire AVX‑512 support on 32‑bit for all compilers and will enable at most AVX2 as part of numpygh-28896
The failures are triggered when the Intel x86 sort AVX‑512 kernels for 16‑bit are enabled at build time and the CPU/OS also supports them. A quick look at the emulated function suggests that the compiler does not correctly generate the instructions for `zmm_vector<float16>::ge(reg_t, reg_t)`. This patch does not actually fix the underlying bug; instead, it disables these kernels on 32‑bit MSVC builds as a stop‑gap, since the issue requires further investigation and an upstream fix. Note: Newer NumPy releases may drop the entire AVX‑512 support on 32‑bit for all compilers and will enable at most AVX2 as part of numpygh-28896
The failures are triggered when the Intel x86 sort AVX‑512 kernels for 16‑bit are enabled at build time and the CPU/OS also supports them. A quick look at the `zmm_vector<float16>::ge(reg_t, reg_t)` seems to not correctly generate the instructions for it. This patch does not actually fix the underlying bug; instead, it disables these kernels on 32‑bit MSVC builds as a stop‑gap, since the issue requires further investigation and an upstream fix. Note: Newer NumPy releases may drop the entire AVX‑512 support on 32‑bit for all compilers and will enable at most AVX2 as part of numpygh-28896
The failures are triggered when the Intel x86 sort AVX‑512 kernels for 16‑bit are enabled at build time and the CPU/OS also supports them. A quick look at the `zmm_vector<float16>::ge(reg_t, reg_t)` seems to not correctly generate the instructions for it. This patch does not actually fix the underlying bug; instead, it disables these kernels on 32‑bit MSVC builds as a stop‑gap, since the issue requires further investigation and an upstream fix. Note: Newer NumPy releases may drop the entire AVX‑512 support on 32‑bit for all compilers and will enable at most AVX2 as part of gh-28896
The failures are triggered when the Intel x86 sort AVX‑512 kernels for 16‑bit are enabled at build time and the CPU/OS also supports them. A quick look at the `zmm_vector<float16>::ge(reg_t, reg_t)` seems to not correctly generate the instructions for it. This patch does not actually fix the underlying bug; instead, it disables these kernels on 32‑bit MSVC builds as a stop‑gap, since the issue requires further investigation and an upstream fix. Note: Newer NumPy releases may drop the entire AVX‑512 support on 32‑bit for all compilers and will enable at most AVX2 as part of numpygh-28896
The failures are triggered when the Intel x86 sort AVX‑512 kernels for 16‑bit are enabled at build time and the CPU/OS also supports them. A quick look at the `zmm_vector<float16>::ge(reg_t, reg_t)` seems to not correctly generate the instructions for it. This patch does not actually fix the underlying bug; instead, it disables these kernels on 32‑bit MSVC builds as a stop‑gap, since the issue requires further investigation and an upstream fix. Note: Newer NumPy releases may drop the entire AVX‑512 support on 32‑bit for all compilers and will enable at most AVX2 as part of numpygh-28896
h-vetinari
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are comments about fixes needing to be backported. Has this happened?
| test_code: files(source_root + '/numpy/distutils/checks/cpu_avx512_cnl.c')[0] | ||
| ) | ||
| HWY_SSE4_FLAGS = ['-DHWY_WANT_SSE4', '-DHWY_DISABLE_PCLMUL_AES'] | ||
| # Use SSE for floating-point on x86-32 to ensure numeric consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here
| test_code: files(current_dir + '/test_x86_v4.c')[0], | ||
| ) | ||
| if cpu_family == 'x86' | ||
| X86_V4.update(disable: 'not supported on x86-32') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
|
@h-vetinari AVX512 was disabled on 32 bit windows in #29910. @seiko2plus thought that was sufficient. |
…atures Two major bugs need to be merged as part of numpygh-28896 and must be fixed: - Force SSE-based floating-point on 32-bit x86 systems to avoid x87 FPU's 80-bit internal precision causing unpredictable rounding and overflow behavior when converting to smaller types. - Fix filtering out successor features when an implied feature is removed through build options `--cpu-baseline` and `--cpu-dispatch`. Documentation `build-options` has been updated for this case.
…atures Two major bugs need to be merged as part of numpygh-28896 and must be fixed: - Force SSE-based floating-point on 32-bit x86 systems to avoid x87 FPU's 80-bit internal precision causing unpredictable rounding and overflow behavior when converting to smaller types. - Fix filtering out successor features when an implied feature is filterd out from `max` within build options `--cpu-baseline` and `--cpu-dispatch`. Documentation `build-options` has been updated for this case. - Updates `build-options` to add missing explanation about option `detect`.
Not yet, there are only two bugs that have been found; they should be back-ported (#30026).
Regarding dropping
This only disables sort operations and only on |
|
Just a quick note: we really don't care about performance on 32-bit Windows, we don't even bundle OpenBLAS so linalg operations are very slow, and we don't get any complaints. It just needs to work with minimal maintenance overhead. |
…atures Two major bugs need to be merged as part of numpygh-28896 and must be fixed: - Force SSE-based floating-point on 32-bit x86 systems to avoid x87 FPU's 80-bit internal precision causing unpredictable rounding and overflow behavior when converting to smaller types. - Fix filtering out successor features when an implied feature is filterd out from `max` within build options `--cpu-baseline` and `--cpu-dispatch`. Documentation `build-options` has been updated for this case. - Updates `build-options` to add missing explanation about option `detect`.
The failures are triggered when the Intel x86 sort AVX‑512 kernels for 16‑bit are enabled at build time and the CPU/OS also supports them. A quick look at the `zmm_vector<float16>::ge(reg_t, reg_t)` seems to not correctly generate the instructions for it. This patch does not actually fix the underlying bug; instead, it disables these kernels on 32‑bit MSVC builds as a stop‑gap, since the issue requires further investigation and an upstream fix. Note: Newer NumPy releases may drop the entire AVX‑512 support on 32‑bit for all compilers and will enable at most AVX2 as part of numpygh-28896
Overview
This PR reorganizes NumPy's CPU build options by replacing individual x86 features with microarchitecture levels. This change aligns with the Google Highway project requirements and common Linux distribution practices.
This PR default setting for
cpu-baselineon x86 has been raised tox86-64-v2microarchitecture as we're in 2025 and adding SIMD compatibility for antiquated CPUs from before 2009 is no longer practical or efficient.This can be changed to
cpu-baseline=noneduring build time to support older CPUs, though manual SIMD optimizations for pre-2009 processors are no longer supported. This change improves performance and reduces binarysize while only affecting hardware that is over 15 years old.
Key Changes
X86_V2,x86_V3, andx86_V4groupsx86_V2), covering features from CPUs since 2009.This improves performance and reduces binary size
-Operator: Corrected to properly exclude successor featuresDetailed CPU Feature Changes
SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,POPCNT) → now inX86_V2XOP,FMA4)AVX512_KNL,AVX512_KNM)AVX,AVX2,FMA3,F16C) → now inX86_V3AVX512F,AVX512CD(from dropping Xeon Phi support)AVX512_SKXtoX86_V4AVX512_CLXandAVX512_CNLAVX512_ICLto includeVAES,GFNI,VPCLMULQDQNew Feature Group Hierarchy
CPU Generation Mapping
X86_V2: x86-64-v2 microarchitectures (CPUs since 2009)X86_V3: x86-64-v3 microarchitectures (CPUs since 2015)X86_V4: x86-64-v4 microarchitectures (AVX-512 capable CPUs)AVX512_ICL: Intel Ice Lake and similar CPUsAVX512_SPR: Intel Sapphire Rapids and newer CPUsDocumentation
Documentation has been updated to reflect these changes and to fit the current meson build system.
closes #27851