Optimize string2ll with load-time CPU feature check using IFUNC resolver#2099
Optimize string2ll with load-time CPU feature check using IFUNC resolver#2099zuiderkwast merged 5 commits intovalkey-io:unstablefrom
Conversation
--------- Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Guo Wangyang <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #2099 +/- ##
============================================
- Coverage 71.25% 71.23% -0.02%
============================================
Files 122 122
Lines 66026 66046 +20
============================================
+ Hits 47047 47051 +4
- Misses 18979 18995 +16
🚀 New features to boost your workflow:
|
--------- Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Guo Wangyang <[email protected]>
zuiderkwast
left a comment
There was a problem hiding this comment.
LGTM, thanks, also @r-devulap for review!
r-devulap
left a comment
There was a problem hiding this comment.
Minor comment, feel free to ignore. Otherwise LGTM.
src/config.h
Outdated
|
|
||
| #if defined(__linux__) && \ | ||
| ((defined(__GNUC__) && (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 3))) || \ | ||
| (defined(__clang__) && (__clang_major__ > 3 || (__clang_major__ == 3 && __clang_minor__ >= 4)))) |
There was a problem hiding this comment.
Do we expect valkey to be compiled with gcc4 and clang3 which are more than a decade old? If not, perhaps we can keep this simple: (clang also defines GNUC)
#if defined(__linux__) && (defined(__GNUC__)
There was a problem hiding this comment.
Better safe. We can ignore 4.x to make the check simpler.
#if defined(__linux__) && defined(__GNUC__) && __GNUC__ > 4
There was a problem hiding this comment.
Clang defines __GNUC__ as 4 regardless of the actual Clang version, so this check fails with clang.
There was a problem hiding this comment.
So we need GNUC > 4 or clang_major > 3 then :)
Btw, if a macro isn't defined, for expample __clang_major__, don't you get a warning for defined(__clang_major) && __clang_major__ > 3 when compiling with -Wundef?
There was a problem hiding this comment.
So we need GNUC > 4 or clang_major > 3 then :)
yeah. But they are so old though :)
Btw, if a macro isn't defined, for expample clang_major, don't you get a warning for defined(__clang_major) && clang_major > 3 when compiling with -Wundef?
Looks like, not if proceeded by #if defined. This seems to be build with gcc main.c -Wundef:
#include <stdio.h>
int main(void) {
#if defined(__clang_major) && __clang_major__ > 3
printf("The value of __clang__major__ is: %d\n", __clang_major__);
#endif
return 0;
}
This one throws a warning:
#include <stdio.h>
int main(void) {
#if __clang_major__ > 3
printf("The value of __clang__major__ is: %d\n", __clang_major__);
#endif
return 0;
}
There was a problem hiding this comment.
Interesting. Thanks, I learned something.
There was a problem hiding this comment.
FYI, the __builtin_cpu_init() function was introduced in Clang with version 6.0, released in 2018. and in GCC with version 4.8, released in 2013.
Signed-off-by: Lipeng Zhu <[email protected]>
|
Seems the crash is caused by the ASan with these built-in cpu functions, particularly A workaround described in google/sanitizers#342 is to add the #if HAVE_IFUNC && HAVE_X86_SIMD
__attribute__((no_sanitize_address))
static int (*string2ll_resolver(void))(const char *, size_t, long long *) {
__builtin_cpu_init();
if (__builtin_cpu_supports("avx512f") &&
__builtin_cpu_supports("avx512vl") &&
__builtin_cpu_supports("avx512bw"))
return string2llAVX512;
return string2llScalar;
}Reproduce: make -j SANITIZER=address # build server with ASangdb --args ~/valkey/src/valkey-server ~/valkey.conf
Program received signal SIGSEGV, Segmentation fault.
0x0000000000528d5b in string2ll_resolver () at ~/valkey/src/util.c:630
630 if (__builtin_cpu_supports("avx512f") &&
(gdb) bt
#0 0x0000000000528d5b in string2ll_resolver () at ~/valkey/src/util.c:630
#1 0x00007ffff7fd5de0 in elf_machine_lazy_rel (skip_ifunc=0, reloc=0x450cf8, l_addr=<optimized out>, scope=0x7ffff7ffe580, map=0x7ffff7ffe210)
at ../sysdeps/x86_64/dl-machine.h:591
#2 elf_dynamic_do_Rela (skip_ifunc=<optimized out>, lazy=<optimized out>, nrelative=<optimized out>, relsize=<optimized out>, reladdr=<optimized out>,
scope=<optimized out>, map=0x7ffff7ffe210) at /usr/src/debug/glibc-2.34-180.el9.x86_64/elf/do-rel.h:79
#3 _dl_relocate_object (l=l@entry=0x7ffff7ffe210, scope=<optimized out>, reloc_mode=<optimized out>, consider_profiling=<optimized out>, consider_profiling@entry=0)
at dl-reloc.c:288
#4 0x00007ffff7fe8089 in dl_main (phdr=<optimized out>, phnum=<optimized out>, user_entry=<optimized out>, auxv=<optimized out>) at rtld.c:2461
#5 0x00007ffff7fe3de3 in _dl_sysdep_start (start_argptr=start_argptr@entry=0x7fffffffde70, dl_main=dl_main@entry=0x7ffff7fe5c60 <dl_main>)
at ../sysdeps/unix/sysv/linux/dl-sysdep.c:140
#6 0x00007ffff7fe5738 in _dl_start_final (arg=0x7fffffffde70) at rtld.c:502
#7 _dl_start (arg=0x7fffffffde70) at rtld.c:587
#8 0x00007ffff7fe4808 in _start () from /lib64/ld-linux-x86-64.so.2
|
ee3603a to
7da368e
Compare
--------- Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Guo Wangyang <[email protected]>
zuiderkwast
left a comment
There was a problem hiding this comment.
Good idea to disable ASan.
Good catch about clang 6 for cpu init. It's OK, it's already very old (2018).
Or do you actually like the cpuid solution more? You can try to convince me. 😆 It's more code but less dependencies...
IMO, While I understand your preference for the __builtin approach 😄, there's a clear trade-off here. The __builtin functions nicely abstract away low-level implementation details, resulting in cleaner, more portable code. However, this abstraction layer may occasionally introduce unexpected behavior due to internal optimizations or compiler-specific implementations, such as conflicts with ASan. |
Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Lipeng Zhu <[email protected]>
Yes, it's a tradeoff. But we use __builtin in the fallback code that runs without ifunc. It's good that these two checks for AVX use the same style (with and without ifunc). Do you want to use cpuid here too? We have solved ASan and now. Do you worry about another problem, like another sanitizer or valgrind? |
|
@zuiderkwast @r-devulap Thank you both for your efforts and suggestions on this matter. |
|
This commit would break alpine build for valkey as Alpine Linux uses |
Follow up of #2099 Fix https://github.com/valkey-io/valkey/actions/runs/15221314860/job/42817270745, https://github.com/valkey-io/valkey/actions/runs/15221314860/job/42817271735 --------- Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Guo Wangyang <[email protected]>
Follow up of valkey-io/valkey#2099 Fix https://github.com/valkey-io/valkey/actions/runs/15221314860/job/42817270745, https://github.com/valkey-io/valkey/actions/runs/15221314860/job/42817271735 --------- Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Guo Wangyang <[email protected]>
…ver (valkey-io#2099) As we increasingly incorporate SIMD instructions to optimize performance-critical functions, our current runtime CPU feature detection pattern is introducing significant overhead. For example, in the **string2ll** function, the feature detection wrapper adds approximately 3.35% performance overhead (shown as Fig 1), which is very significant especially considering the AVX-optimized implementation itself is highly effective. Current pattern: ```c int string2ll(const char *s, size_t slen, long long *value) { if (__builtin_cpu_supports("avx512f") && __builtin_cpu_supports("avx512vl") && __builtin_cpu_supports("avx512bw")) return string2llAVX512(s, slen, value); return string2llScalar(s, slen, value); } ```  Fig 1. Profiling of [memtier_benchmark-1key-list-10K-elements-lpos-integer](https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1key-list-10K-elements-lpos-integer.yml) This PR implements the **GNU IFUNC** (indirect function) mechanism, which resolves the optimal function implementation at program load time rather than checking CPU features during each function call. This approach eliminates repeated detection overhead in hot code paths. 1. **Performance Improvement**: Eliminates feature detection overhead (~3.35% for string2ll) 2. **One-time Detection**: CPU capabilities are checked once at load time 3. **Zero Runtime Overhead**: No conditionals in the call path 4. **Maintainability**: Cleaner code without conditional branches The example test suite we can observe a ~3% performance improvement. |Benchmark|Performance Boost| |-|-| |[memtier_benchmark-1key-list-10K-elements-lpos-integer](https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1key-list-10K-elements-lpos-integer.yml)|**3%**| - OS: CentOS Stream 9 - Platform: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz - Server and Client in same socket ``` taskset -c 0 ~/valkey/src/valkey-server ~/tmp_valkey.conf port 9001 bind * -::* daemonize no protected-mode no save ``` --------- Signed-off-by: Lipeng Zhu <[email protected]> Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Guo Wangyang <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]>
…2133) Follow up of valkey-io#2099 Fix https://github.com/valkey-io/valkey/actions/runs/15221314860/job/42817270745, https://github.com/valkey-io/valkey/actions/runs/15221314860/job/42817271735 --------- Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Guo Wangyang <[email protected]> Signed-off-by: shanwan1 <[email protected]>
Problem Statement
As we increasingly incorporate SIMD instructions to optimize performance-critical functions, our current runtime CPU feature detection pattern is introducing significant overhead. For example, in the string2ll function, the feature detection wrapper adds approximately 3.35% performance overhead (shown as Fig 1), which is very significant especially considering the AVX-optimized implementation itself is highly effective.
Current pattern:
Fig 1. Profiling of memtier_benchmark-1key-list-10K-elements-lpos-integer
Proposed Solution
This PR implements the GNU IFUNC (indirect function) mechanism, which resolves the optimal function implementation at program load time rather than checking CPU features during each function call. This approach eliminates repeated detection overhead in hot code paths.
Benefits
Performance Boost
The example test suite we can observe a ~3% performance improvement.
Test Env
Valkey-server configuration