Optimize string2ll with load-time CPU feature check using IFUNC resolver by zhulipeng · Pull Request #2099 · valkey-io/valkey

zhulipeng · 2025-05-19T03:08:07Z

Problem Statement

As we increasingly incorporate SIMD instructions to optimize performance-critical functions, our current runtime CPU feature detection pattern is introducing significant overhead. For example, in the string2ll function, the feature detection wrapper adds approximately 3.35% performance overhead (shown as Fig 1), which is very significant especially considering the AVX-optimized implementation itself is highly effective.

Current pattern:

int string2ll(const char *s, size_t slen, long long *value) {
#if HAVE_X86_SIMD
    if (__builtin_cpu_supports("avx512f") &&
        __builtin_cpu_supports("avx512vl") &&
        __builtin_cpu_supports("avx512bw"))
        return string2llAVX512(s, slen, value);
#endif
    return string2llScalar(s, slen, value);
}

Fig 1. Profiling of memtier_benchmark-1key-list-10K-elements-lpos-integer

Proposed Solution

This PR implements the GNU IFUNC (indirect function) mechanism, which resolves the optimal function implementation at program load time rather than checking CPU features during each function call. This approach eliminates repeated detection overhead in hot code paths.

Benefits

Performance Improvement: Eliminates feature detection overhead (~3.35% for string2ll)
One-time Detection: CPU capabilities are checked once at load time
Zero Runtime Overhead: No conditionals in the call path
Maintainability: Cleaner code without conditional branches

Performance Boost

The example test suite we can observe a ~3% performance improvement.

Benchmark	Performance Boost
memtier_benchmark-1key-list-10K-elements-lpos-integer	3%

Test Env

OS: CentOS Stream 9
Platform: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
Server and Client in same socket

Valkey-server configuration

taskset -c 0 ~/valkey/src/valkey-server ~/tmp_valkey.conf
port 9001
bind * -::*
daemonize no
protected-mode no
save

--------- Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Guo Wangyang <[email protected]>

codecov · 2025-05-19T03:23:15Z

Codecov Report

Attention: Patch coverage is 50.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 71.23%. Comparing base (8d686dd) to head (c303183).
Report is 23 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/util.c	50.00%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2099      +/-   ##
============================================
- Coverage     71.25%   71.23%   -0.02%     
============================================
  Files           122      122              
  Lines         66026    66046      +20     
============================================
+ Hits          47047    47051       +4     
- Misses        18979    18995      +16

Files with missing lines	Coverage Δ
src/util.c	`65.56% <50.00%> (ø)`

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

src/util.c

--------- Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Guo Wangyang <[email protected]>

zuiderkwast

LGTM, thanks, also @r-devulap for review!

r-devulap

Minor comment, feel free to ignore. Otherwise LGTM.

r-devulap · 2025-05-22T04:04:43Z

src/config.h


+#if defined(__linux__) &&                                                               \
+    ((defined(__GNUC__) && (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 3))) || \
+     (defined(__clang__) && (__clang_major__ > 3 || (__clang_major__ == 3 && __clang_minor__ >= 4))))


Do we expect valkey to be compiled with gcc4 and clang3 which are more than a decade old? If not, perhaps we can keep this simple: (clang also defines GNUC)

#if defined(__linux__) && (defined(__GNUC__)

Better safe. We can ignore 4.x to make the check simpler.

#if defined(__linux__) && defined(__GNUC__) && __GNUC__ > 4

Clang defines __GNUC__ as 4 regardless of the actual Clang version, so this check fails with clang.

So we need GNUC > 4 or clang_major > 3 then :)

Btw, if a macro isn't defined, for expample __clang_major__, don't you get a warning for defined(__clang_major) && __clang_major__ > 3 when compiling with -Wundef?

So we need GNUC > 4 or clang_major > 3 then :)

yeah. But they are so old though :)

Btw, if a macro isn't defined, for expample clang_major, don't you get a warning for defined(__clang_major) && clang_major > 3 when compiling with -Wundef?

Looks like, not if proceeded by #if defined. This seems to be build with gcc main.c -Wundef:

#include <stdio.h> int main(void) { #if defined(__clang_major) && __clang_major__ > 3 printf("The value of __clang__major__ is: %d\n", __clang_major__); #endif return 0; }

This one throws a warning:

#include <stdio.h> int main(void) { #if __clang_major__ > 3 printf("The value of __clang__major__ is: %d\n", __clang_major__); #endif return 0; }

Interesting. Thanks, I learned something.

FYI, the __builtin_cpu_init() function was introduced in Clang with version 6.0, released in 2018. and in GCC with version 4.8, released in 2013.

Signed-off-by: Lipeng Zhu <[email protected]>

zhulipeng · 2025-05-23T02:29:48Z

Seems the crash is caused by the ASan with these built-in cpu functions, particularly __builtin_cpu_init, might allocate data on the stack or call other functions that involve stack usage. When called before ASan's full initialization, these stack operations can trigger ASan errors.

A workaround described in google/sanitizers#342 is to add the__attribute__((no_sanitize_address)) to the resolver function.
Or back to the cpuid solution?

#if HAVE_IFUNC && HAVE_X86_SIMD
__attribute__((no_sanitize_address))
static int (*string2ll_resolver(void))(const char *, size_t, long long *) {
    __builtin_cpu_init();
    if (__builtin_cpu_supports("avx512f") &&
        __builtin_cpu_supports("avx512vl") &&
        __builtin_cpu_supports("avx512bw"))
        return string2llAVX512;
    return string2llScalar;
}

Reproduce:

make -j SANITIZER=address # build server with ASan

gdb --args ~/valkey/src/valkey-server ~/valkey.conf

Program received signal SIGSEGV, Segmentation fault.
0x0000000000528d5b in string2ll_resolver () at ~/valkey/src/util.c:630
630         if (__builtin_cpu_supports("avx512f") &&
(gdb) bt
#0  0x0000000000528d5b in string2ll_resolver () at ~/valkey/src/util.c:630
#1  0x00007ffff7fd5de0 in elf_machine_lazy_rel (skip_ifunc=0, reloc=0x450cf8, l_addr=<optimized out>, scope=0x7ffff7ffe580, map=0x7ffff7ffe210)
    at ../sysdeps/x86_64/dl-machine.h:591
#2  elf_dynamic_do_Rela (skip_ifunc=<optimized out>, lazy=<optimized out>, nrelative=<optimized out>, relsize=<optimized out>, reladdr=<optimized out>,
    scope=<optimized out>, map=0x7ffff7ffe210) at /usr/src/debug/glibc-2.34-180.el9.x86_64/elf/do-rel.h:79
#3  _dl_relocate_object (l=l@entry=0x7ffff7ffe210, scope=<optimized out>, reloc_mode=<optimized out>, consider_profiling=<optimized out>, consider_profiling@entry=0)
    at dl-reloc.c:288
#4  0x00007ffff7fe8089 in dl_main (phdr=<optimized out>, phnum=<optimized out>, user_entry=<optimized out>, auxv=<optimized out>) at rtld.c:2461
#5  0x00007ffff7fe3de3 in _dl_sysdep_start (start_argptr=start_argptr@entry=0x7fffffffde70, dl_main=dl_main@entry=0x7ffff7fe5c60 <dl_main>)
    at ../sysdeps/unix/sysv/linux/dl-sysdep.c:140
#6  0x00007ffff7fe5738 in _dl_start_final (arg=0x7fffffffde70) at rtld.c:502
#7  _dl_start (arg=0x7fffffffde70) at rtld.c:587
#8  0x00007ffff7fe4808 in _start () from /lib64/ld-linux-x86-64.so.2

--------- Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Guo Wangyang <[email protected]>

zuiderkwast

Good idea to disable ASan.

Good catch about clang 6 for cpu init. It's OK, it's already very old (2018).

Or do you actually like the cpuid solution more? You can try to convince me. 😆 It's more code but less dependencies...

src/util.c

zhulipeng · 2025-05-23T08:30:53Z

Or do you actually like the cpuid solution more? You can try to convince me. 😆 It's more code but less dependencies

IMO, cpuid is simpler. Using the cpuid instruction directly provides more immediate access and full control over the detection logic compared to the __builtin_cpu_init and __builtin_cpu_supports functions.

While I understand your preference for the __builtin approach 😄, there's a clear trade-off here. The __builtin functions nicely abstract away low-level implementation details, resulting in cleaner, more portable code. However, this abstraction layer may occasionally introduce unexpected behavior due to internal optimizations or compiler-specific implementations, such as conflicts with ASan.

Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Lipeng Zhu <[email protected]>

zuiderkwast · 2025-05-23T08:43:43Z

While I understand your preference for the __builtin approach 😄, there's a clear trade-off here. The __builtin functions nicely abstract away low-level implementation details, resulting in cleaner, more portable code. However, this abstraction layer may occasionally introduce unexpected behavior due to internal optimizations or compiler-specific implementations, such as conflicts with ASan.

Yes, it's a tradeoff.

But we use __builtin in the fallback code that runs without ifunc. It's good that these two checks for AVX use the same style (with and without ifunc). Do you want to use cpuid here too?

We have solved ASan and now. Do you worry about another problem, like another sanitizer or valgrind?

zhulipeng · 2025-05-23T09:52:00Z

@zuiderkwast @r-devulap Thank you both for your efforts and suggestions on this matter.
@r-devulap Glad to see you in the Valkey community! 🙂

roshkhatri · 2025-05-23T20:33:02Z

This commit would break alpine build for valkey as Alpine Linux uses musl-libc instead of GLIBC, and musl-libc does not implement the IFUNC mechanism. just saw the failure in a PR for valkey-conainter as mentioned by alaviss here: valkey-io/valkey-container#70

Follow up of #2099 Fix https://github.com/valkey-io/valkey/actions/runs/15221314860/job/42817270745, https://github.com/valkey-io/valkey/actions/runs/15221314860/job/42817271735 --------- Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Guo Wangyang <[email protected]>

Follow up of valkey-io/valkey#2099 Fix https://github.com/valkey-io/valkey/actions/runs/15221314860/job/42817270745, https://github.com/valkey-io/valkey/actions/runs/15221314860/job/42817271735 --------- Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Guo Wangyang <[email protected]>

…ver (valkey-io#2099) As we increasingly incorporate SIMD instructions to optimize performance-critical functions, our current runtime CPU feature detection pattern is introducing significant overhead. For example, in the **string2ll** function, the feature detection wrapper adds approximately 3.35% performance overhead (shown as Fig 1), which is very significant especially considering the AVX-optimized implementation itself is highly effective. Current pattern: ```c int string2ll(const char *s, size_t slen, long long *value) { if (__builtin_cpu_supports("avx512f") && __builtin_cpu_supports("avx512vl") && __builtin_cpu_supports("avx512bw")) return string2llAVX512(s, slen, value); return string2llScalar(s, slen, value); } ``` ![image](https://github.com/user-attachments/assets/16508faa-cc7d-4437-9de3-09c58522026e) Fig 1. Profiling of [memtier_benchmark-1key-list-10K-elements-lpos-integer](https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1key-list-10K-elements-lpos-integer.yml) This PR implements the **GNU IFUNC** (indirect function) mechanism, which resolves the optimal function implementation at program load time rather than checking CPU features during each function call. This approach eliminates repeated detection overhead in hot code paths. 1. **Performance Improvement**: Eliminates feature detection overhead (~3.35% for string2ll) 2. **One-time Detection**: CPU capabilities are checked once at load time 3. **Zero Runtime Overhead**: No conditionals in the call path 4. **Maintainability**: Cleaner code without conditional branches The example test suite we can observe a ~3% performance improvement. |Benchmark|Performance Boost| |-|-| |[memtier_benchmark-1key-list-10K-elements-lpos-integer](https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1key-list-10K-elements-lpos-integer.yml)|**3%**| - OS: CentOS Stream 9 - Platform: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz - Server and Client in same socket ``` taskset -c 0 ~/valkey/src/valkey-server ~/tmp_valkey.conf port 9001 bind * -::* daemonize no protected-mode no save ``` --------- Signed-off-by: Lipeng Zhu <[email protected]> Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Guo Wangyang <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]>

…2133) Follow up of valkey-io#2099 Fix https://github.com/valkey-io/valkey/actions/runs/15221314860/job/42817270745, https://github.com/valkey-io/valkey/actions/runs/15221314860/job/42817271735 --------- Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Guo Wangyang <[email protected]> Signed-off-by: shanwan1 <[email protected]>

CPU Feature Detection Optimization Using IFUNC

67cab34

--------- Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Guo Wangyang <[email protected]>

r-devulap reviewed May 20, 2025

View reviewed changes

src/util.c Outdated Show resolved Hide resolved

src/util.c Outdated Show resolved Hide resolved

r-devulap reviewed May 20, 2025

View reviewed changes

src/util.c Outdated Show resolved Hide resolved

zhulipeng force-pushed the ifunc branch from 464c044 to 26ff06d Compare May 22, 2025 01:28

address review comments

26ff06d

--------- Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Guo Wangyang <[email protected]>

zuiderkwast approved these changes May 22, 2025

View reviewed changes

r-devulap approved these changes May 22, 2025

View reviewed changes

address comments

cb5ac09

Signed-off-by: Lipeng Zhu <[email protected]>

zhulipeng force-pushed the ifunc branch 2 times, most recently from ee3603a to 7da368e Compare May 23, 2025 04:29

Fix test error.

7da368e

--------- Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Guo Wangyang <[email protected]>

zuiderkwast approved these changes May 23, 2025

View reviewed changes

src/util.c Show resolved Hide resolved

Update src/util.c

c303183

Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Lipeng Zhu <[email protected]>

zuiderkwast changed the title ~~CPU Feature Detection Optimization Using IFUNC~~ Optimize string2ll with load-time CPU feature check using IFUNC resolver May 23, 2025

zuiderkwast merged commit 4092c9c into valkey-io:unstable May 23, 2025
50 of 51 checks passed

zuiderkwast added the release-notes This issue should get a line item in the release notes label May 23, 2025

zhulipeng deleted the ifunc branch May 23, 2025 09:56

alaviss mentioned this pull request May 23, 2025

Switch to multi-stage build valkey-io/valkey-container#70

Merged

zhulipeng mentioned this pull request May 24, 2025

Fix the GNU IFUNC error in alpine (non-glibc) environment #2133

Merged

zhulipeng added the performance label May 26, 2025

paulorsousa mentioned this pull request Jun 4, 2025

Optimize string2ll performance using AVX512 redis/redis#14101

Closed

zuiderkwast mentioned this pull request Sep 2, 2025

Don't use AVX2 instructions if the CPU don't support it #2571

Merged

Conversation

zhulipeng commented May 19, 2025

Problem Statement

Proposed Solution

Benefits

Performance Boost

Test Env

Valkey-server configuration

Uh oh!

codecov bot commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

r-devulap left a comment

Choose a reason for hiding this comment

Uh oh!

r-devulap May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuiderkwast May 22, 2025

Choose a reason for hiding this comment

Uh oh!

r-devulap May 22, 2025

Choose a reason for hiding this comment

Uh oh!

zuiderkwast May 22, 2025

Choose a reason for hiding this comment

Uh oh!

r-devulap May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuiderkwast May 22, 2025

Choose a reason for hiding this comment

Uh oh!

zhulipeng May 23, 2025

Choose a reason for hiding this comment

Uh oh!

zhulipeng commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhulipeng commented May 23, 2025

Uh oh!

zuiderkwast commented May 23, 2025

Uh oh!

Uh oh!

zhulipeng commented May 23, 2025

Uh oh!

roshkhatri commented May 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented May 19, 2025 •

edited

Loading

r-devulap May 22, 2025 •

edited

Loading

r-devulap May 22, 2025 •

edited

Loading

zhulipeng commented May 23, 2025 •

edited

Loading