Skip to content

detect intrinsics: ensure intrinsics are not optimized out before assembler is run#1562

Merged
Dead2 merged 4 commits intozlib-ng:developfrom
haampie:fix/avx-detection-with-old-assembler
Sep 13, 2023
Merged

detect intrinsics: ensure intrinsics are not optimized out before assembler is run#1562
Dead2 merged 4 commits intozlib-ng:developfrom
haampie:fix/avx-detection-with-old-assembler

Conversation

@haampie
Copy link
Copy Markdown
Contributor

@haampie haampie commented Aug 21, 2023

Closes #1559

Old binutils may have an assembler that does not support vector instructions
generated by gcc.

Although it's recommended to use an up-to-date assembler with gcc, it's happens
in practice that new gcc and old binutils are mixed.

The problem with the current configure script is that it runs tests with -O2
cflags, which causes the test inside main() { ... } to be optimized out
entirely, and the assembler won't see any of the vector instructions, leading
to false positives in tests.

This PR ensures that most configure tests use a separate function, taking and
returning some __mm* type, to avoid optimizations, so that the assembler will
error on seeing unsupported instructions.

Alternatively, -O0 could be passed, but I don't think that is always
sufficient, since some compilers cannot not optimize ;p I believe nvhpc is
one of those.

@codecov
Copy link
Copy Markdown

codecov bot commented Aug 21, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (0cb2b37) 83.87% compared to head (1605ec8) 83.88%.
Report is 13 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #1562   +/-   ##
========================================
  Coverage    83.87%   83.88%           
========================================
  Files          132      132           
  Lines        10843    10843           
  Branches      2801     2801           
========================================
+ Hits          9095     9096    +1     
+ Misses        1049     1048    -1     
  Partials       699      699           
Flag Coverage Δ
macos_clang 42.97% <ø> (ø)
macos_gcc 73.61% <ø> (ø)
ubuntu_clang 82.50% <ø> (ø)
ubuntu_clang_debug 81.90% <ø> (-0.24%) ⬇️
ubuntu_clang_inflate_allow_invalid_dist 82.15% <ø> (ø)
ubuntu_clang_inflate_strict 82.49% <ø> (ø)
ubuntu_clang_mmap 82.82% <ø> (ø)
ubuntu_clang_pigz 13.96% <ø> (ø)
ubuntu_clang_pigz_no_optim 11.51% <ø> (ø)
ubuntu_clang_pigz_no_threads 13.73% <ø> (-0.06%) ⬇️
ubuntu_clang_reduced_mem 82.67% <ø> (-0.23%) ⬇️
ubuntu_clang_toolchain_riscv ∅ <ø> (∅)
ubuntu_gcc 75.30% <ø> (+0.07%) ⬆️
ubuntu_gcc_aarch64 77.40% <ø> (ø)
ubuntu_gcc_aarch64_compat_no_opt 75.65% <ø> (ø)
ubuntu_gcc_aarch64_no_acle 76.16% <ø> (ø)
ubuntu_gcc_aarch64_no_neon 76.16% <ø> (ø)
ubuntu_gcc_armhf 77.47% <ø> (ø)
ubuntu_gcc_armhf_compat_no_opt 75.60% <ø> (ø)
ubuntu_gcc_armhf_no_acle 77.42% <ø> (ø)
ubuntu_gcc_armhf_no_neon 77.32% <ø> (ø)
ubuntu_gcc_armsf 74.65% <ø> (ø)
ubuntu_gcc_armsf_compat_no_opt 74.09% <ø> (ø)
ubuntu_gcc_benchmark 73.57% <ø> (+0.16%) ⬆️
ubuntu_gcc_compat_no_opt 76.85% <ø> (+0.01%) ⬆️
ubuntu_gcc_compat_sprefix 73.73% <ø> (ø)
ubuntu_gcc_m32 73.23% <ø> (-0.16%) ⬇️
ubuntu_gcc_mingw_i686 73.50% <ø> (ø)
ubuntu_gcc_mingw_x86_64 73.51% <ø> (ø)
ubuntu_gcc_mips 74.97% <ø> (ø)
ubuntu_gcc_mips64 74.98% <ø> (ø)
ubuntu_gcc_no_avx2 74.34% <ø> (+0.05%) ⬆️
ubuntu_gcc_no_ctz 74.65% <ø> (ø)
ubuntu_gcc_no_ctzll 74.64% <ø> (ø)
ubuntu_gcc_no_pclmulqdq 74.28% <ø> (ø)
ubuntu_gcc_no_sse2 74.54% <ø> (ø)
ubuntu_gcc_no_sse42 74.17% <ø> (-0.56%) ⬇️
ubuntu_gcc_o1 74.18% <ø> (+0.05%) ⬆️
ubuntu_gcc_osb ∅ <ø> (∅)
ubuntu_gcc_pigz 38.23% <ø> (+0.07%) ⬆️
ubuntu_gcc_pigz_aarch64 39.02% <ø> (-0.03%) ⬇️
ubuntu_gcc_ppc 73.92% <ø> (ø)
ubuntu_gcc_ppc64 74.36% <ø> (ø)
ubuntu_gcc_ppc64_power9 74.53% <ø> (ø)
ubuntu_gcc_ppc64le 74.43% <ø> (ø)
ubuntu_gcc_ppc64le_novsx 74.75% <ø> (ø)
ubuntu_gcc_ppc64le_power9 74.31% <ø> (ø)
ubuntu_gcc_ppc_no_power8 74.63% <ø> (ø)
ubuntu_gcc_s390x 74.80% <ø> (ø)
ubuntu_gcc_s390x_dfltcc 71.92% <ø> (ø)
ubuntu_gcc_s390x_dfltcc_compat 73.98% <ø> (ø)
ubuntu_gcc_s390x_no_crc32 74.59% <ø> (ø)
ubuntu_gcc_sparc64 74.79% <ø> (ø)
ubuntu_gcc_sprefix 73.24% <ø> (ø)
win64_gcc 73.95% <ø> (-0.05%) ⬇️
win64_gcc_compat_no_opt 74.71% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

see 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@phprus
Copy link
Copy Markdown
Contributor

phprus commented Aug 21, 2023

Are checks in CMake (https://github.com/zlib-ng/zlib-ng/blob/develop/cmake/detect-intrinsics.cmake) also affected?

@haampie
Copy link
Copy Markdown
Contributor Author

haampie commented Aug 21, 2023

Looks like the CMake version does not need patching, will check why that is. Probably they don't set optimization flags during tests.

Edit: yeah, that is the case. I think it would be good anyways to keep the tests in sync? And given that certain (hardware vendor) compilers like to optimize by default, I think that's generally an improvement.

(void)c;
return 0;
}"
#include <wmmintrin.h>
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This include was added in configure but missing in cmake? 🤔

@haampie
Copy link
Copy Markdown
Contributor Author

haampie commented Aug 21, 2023

Here's another fun false positive issue, which suggests to run tests with optimizations enabled

From https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64554.

Using GCC 12, this:

// test-1.c
#include <wmmintrin.h>
int main() { 
    __m128i a, b;
    _mm_clmulepi64_si128(a, b, 0x10);
}

compiles fine without any flags:

$ gcc test-1.c

but when you turn optimization flags on it errors:

$ gcc -O1 test-1.c
In file included from test-1.c:1:
/usr/lib/gcc/x86_64-linux-gnu/12/include/wmmintrin.h: In function ‘main’:
/usr/lib/gcc/x86_64-linux-gnu/12/include/wmmintrin.h:116:1: error: inlining failed in call to ‘always_inline’ ‘_mm_clmulepi64_si128’: target specific option mismatch
  116 | _mm_clmulepi64_si128 (__m128i __X, __m128i __Y, const int __I)
      | ^~~~~~~~~~~~~~~~~~~~
test-1.c:4:5: note: called from here
    4 |     _mm_clmulepi64_si128(a, b, 0x10);
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On the other hand, when defined outside of main:

// test-2.c
#include <wmmintrin.h>
__m128i f(__m128i a, __m128i b) { return _mm_clmulepi64_si128(a, b, 0x10); }
int main() { return 0; }

it fails with and without optimization flags:

$ gcc test-2.c
In file included from test-2.c:1:
test-2.c: In function ‘f’:
test-2.c:2:42: error: ‘__builtin_ia32_pclmulqdq128’ needs isa option -mpclmul -msse2
    2 | __m128i f(__m128i a, __m128i b) { return _mm_clmulepi64_si128(a, b, 0x10); }
      |     

Edit: the configure / cmake test was correct, the difference is storing the result in __m128i c = ...

@haampie haampie changed the title configure: ensure instructions are not optimized out before assembler is run ensure intrinsics are not optimized out before assembler is run Aug 21, 2023
@haampie haampie changed the title ensure intrinsics are not optimized out before assembler is run detect intrinsics: ensure intrinsics are not optimized out before assembler is run Aug 21, 2023
@jrwrigh
Copy link
Copy Markdown

jrwrigh commented Aug 21, 2023

I've confirmed that this addresses the issues reported in #1559

@Dead2 Dead2 requested a review from KungFuJesus August 23, 2023 18:42
@Dead2
Copy link
Copy Markdown
Member

Dead2 commented Aug 27, 2023

LGTM, but I am awaiting code review by others.

Copy link
Copy Markdown
Member

@nmoinvaz nmoinvaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this PR locally against MSVC and it detected AVX2, AVX512, SSE42, SSE2, SSSE3, PCLMULQDQ, VPCLMULQDQ, and XSAVE intrinsics.

I'm surprised about the compiler not ignoring f() function since it is not used anywhere. I'm not sure if those changes are okay, because we have check_c_source_compile_or_run. The instructions should be executed when WITH_NATIVE_INSTRUCTIONS=ON.

@haampie
Copy link
Copy Markdown
Contributor Author

haampie commented Aug 29, 2023

I'm surprised about the compiler not ignoring f() function since it is not used anywhere.

It's because it's not static f(), so C has to create a symbol for it. I think it only gets deleted with -flto and -fwhole-program (the latter is GCC specific, effectively the same as adding static to every function in the compilation unit).

@phprus
Copy link
Copy Markdown
Contributor

phprus commented Aug 29, 2023

It's because it's not static f(), so C has to create a symbol for it. I think it only gets deleted with -flto and -fwhole-program (the latter is GCC specific, effectively the same as adding static to every function in the compilation unit).

Does building with CMake option -DCMAKE_C_VISIBILITY_PRESET=hidden work?

@haampie
Copy link
Copy Markdown
Contributor Author

haampie commented Aug 29, 2023

It's because it's not static f(), so C has to create a symbol for it. I think it only gets deleted with -flto and -fwhole-program (the latter is GCC specific, effectively the same as adding static to every function in the compilation unit).

Does building with CMake option -DCMAKE_C_VISIBILITY_PRESET=hidden work?

I think that only marks the symbol as such (for shared linking), the symbol is still added to the binary.

Another way to symbols get removed is -O1 -ffunction-sections -Wl,--gc-sections (-O1 to make sure that function get inlined, so that -ffunction-sections -Wl,--gc-sections can remove function that aren't called), but that shouldn't matter for this PR either since the compiler + assembler still create an object file before the linker removes stuff from it.

@haampie
Copy link
Copy Markdown
Contributor Author

haampie commented Aug 29, 2023

Regarding native instructions, why don't you trust -march=native to do just that? Why run executables?

@nmoinvaz
Copy link
Copy Markdown
Member

That is a good question, I don't know. Perhaps @Dead2 or @mtl1979 can chime in.

@haampie
Copy link
Copy Markdown
Contributor Author

haampie commented Aug 29, 2023

Asking cause it would simplify my PR 😆

@mtl1979
Copy link
Copy Markdown
Collaborator

mtl1979 commented Aug 29, 2023

There are "processors" that implement certain instructions incorrectly and running the executable will make sure we catch any crash issues.

@haampie
Copy link
Copy Markdown
Contributor Author

haampie commented Aug 29, 2023

Aren't tests better at catching that issue?

@Dead2
Copy link
Copy Markdown
Member

Dead2 commented Aug 30, 2023

Regarding native instructions, why don't you trust -march=native to do just that? Why run executables?

That is a very good question. Not running them would be faster and simpler.
I honestly can't remember why we started running the instruction set tests, and I suspect it is not really needed. The compiler should be correct, so running it would only be making doubly sure.

Potentially this could be used to compile only the optimized functions that actually run on the current cpu when using a compiler that does not support -march=native or similar, but we don't do that currently. Is there any interest in this? Think MSVC for example. The next step after that would be to only compile in the best optimization for the machine. But it'd require a good chunk of extra cmake code to implement such logic.

If you want to attempt to change this, I think it would be best if that was a separate commit/PR, so it is easier to bisect/revert.

@nmoinvaz
Copy link
Copy Markdown
Member

nmoinvaz commented Aug 30, 2023

This is the commit 0b8418e where the change was made to use check_c_source_runs:

CMakeLists.txt: use check_c_source_runs instead of check_c_source_compiles
to try to avoid using intrinsics and an instruction set the compiler
knows but the host CPU doesn't support.

It seems like they could have just use WITH_NATIVE_INSTRUCTIONS because that is the purpose. Original commit didn't state why WITH_NATIVE_INSTRUCTIONS is not a valid solution.

I am infavor, of having a separate commit that removes check_c_source_compile_or_run and just going back to using check_c_source_compiles.

Also AFAIK, configure script does not attempt to run anything, only compile.

Copy link
Copy Markdown
Member

@Dead2 Dead2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Dead2 Dead2 merged commit ca2d4e5 into zlib-ng:develop Sep 13, 2023
@Dead2 Dead2 mentioned this pull request Oct 13, 2023
@haampie haampie deleted the fix/avx-detection-with-old-assembler branch October 21, 2023 15:21
haampie added a commit to haampie/zlib-ng that referenced this pull request Jul 29, 2025
On RHEL9 the GCC is new enough to support AVX512-VNNI, but its assembler
(binutils) is not and errors with

```
Error: unsupported instruction vpdpbusd
```

This was already addressed earlier in
zlib-ng#1562 to some extent, except that
a check for `_mm256_dpbusd_epi32` was not added, which is what the
assembler errors over.
Dead2 pushed a commit that referenced this pull request Aug 3, 2025
On RHEL9 the GCC is new enough to support AVX512-VNNI, but its assembler
(binutils) is not and errors with

```
Error: unsupported instruction vpdpbusd
```

This was already addressed earlier in
#1562 to some extent, except that
a check for `_mm256_dpbusd_epi32` was not added, which is what the
assembler errors over.
Dead2 pushed a commit that referenced this pull request Aug 3, 2025
On RHEL9 the GCC is new enough to support AVX512-VNNI, but its assembler
(binutils) is not and errors with

```
Error: unsupported instruction vpdpbusd
```

This was already addressed earlier in
#1562 to some extent, except that
a check for `_mm256_dpbusd_epi32` was not added, which is what the
assembler errors over.
Dead2 pushed a commit that referenced this pull request Aug 7, 2025
On RHEL9 the GCC is new enough to support AVX512-VNNI, but its assembler
(binutils) is not and errors with

```
Error: unsupported instruction vpdpbusd
```

This was already addressed earlier in
#1562 to some extent, except that
a check for `_mm256_dpbusd_epi32` was not added, which is what the
assembler errors over.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

autotools configure script does not discern AVX512 hardware support

6 participants