Skip to content

Optimize string2ll with load-time CPU feature check using IFUNC resolver#2099

Merged
zuiderkwast merged 5 commits intovalkey-io:unstablefrom
zhulipeng:ifunc
May 23, 2025
Merged

Optimize string2ll with load-time CPU feature check using IFUNC resolver#2099
zuiderkwast merged 5 commits intovalkey-io:unstablefrom
zhulipeng:ifunc

Conversation

@zhulipeng
Copy link
Member

Problem Statement

As we increasingly incorporate SIMD instructions to optimize performance-critical functions, our current runtime CPU feature detection pattern is introducing significant overhead. For example, in the string2ll function, the feature detection wrapper adds approximately 3.35% performance overhead (shown as Fig 1), which is very significant especially considering the AVX-optimized implementation itself is highly effective.

Current pattern:

int string2ll(const char *s, size_t slen, long long *value) {
#if HAVE_X86_SIMD
    if (__builtin_cpu_supports("avx512f") &&
        __builtin_cpu_supports("avx512vl") &&
        __builtin_cpu_supports("avx512bw"))
        return string2llAVX512(s, slen, value);
#endif
    return string2llScalar(s, slen, value);
}

image

Fig 1. Profiling of memtier_benchmark-1key-list-10K-elements-lpos-integer

Proposed Solution

This PR implements the GNU IFUNC (indirect function) mechanism, which resolves the optimal function implementation at program load time rather than checking CPU features during each function call. This approach eliminates repeated detection overhead in hot code paths.

Benefits

  1. Performance Improvement: Eliminates feature detection overhead (~3.35% for string2ll)
  2. One-time Detection: CPU capabilities are checked once at load time
  3. Zero Runtime Overhead: No conditionals in the call path
  4. Maintainability: Cleaner code without conditional branches

Performance Boost

The example test suite we can observe a ~3% performance improvement.

Benchmark Performance Boost
memtier_benchmark-1key-list-10K-elements-lpos-integer 3%

Test Env

  • OS: CentOS Stream 9
  • Platform: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
  • Server and Client in same socket

Valkey-server configuration

taskset -c 0 ~/valkey/src/valkey-server ~/tmp_valkey.conf
port 9001
bind * -::*
daemonize no
protected-mode no
save

---------

Signed-off-by: Lipeng Zhu <[email protected]>
Co-authored-by: Guo Wangyang <[email protected]>
@codecov
Copy link

codecov bot commented May 19, 2025

Codecov Report

Attention: Patch coverage is 50.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 71.23%. Comparing base (8d686dd) to head (c303183).
Report is 23 commits behind head on unstable.

Files with missing lines Patch % Lines
src/util.c 50.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2099      +/-   ##
============================================
- Coverage     71.25%   71.23%   -0.02%     
============================================
  Files           122      122              
  Lines         66026    66046      +20     
============================================
+ Hits          47047    47051       +4     
- Misses        18979    18995      +16     
Files with missing lines Coverage Δ
src/util.c 65.56% <50.00%> (ø)

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

---------

Signed-off-by: Lipeng Zhu <[email protected]>
Co-authored-by: Guo Wangyang <[email protected]>
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks, also @r-devulap for review!

Copy link

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, feel free to ignore. Otherwise LGTM.

src/config.h Outdated

#if defined(__linux__) && \
((defined(__GNUC__) && (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 3))) || \
(defined(__clang__) && (__clang_major__ > 3 || (__clang_major__ == 3 && __clang_minor__ >= 4))))
Copy link

@r-devulap r-devulap May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect valkey to be compiled with gcc4 and clang3 which are more than a decade old? If not, perhaps we can keep this simple: (clang also defines GNUC)

#if defined(__linux__) && (defined(__GNUC__) 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better safe. We can ignore 4.x to make the check simpler.

#if defined(__linux__) && defined(__GNUC__) && __GNUC__ > 4

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clang defines __GNUC__ as 4 regardless of the actual Clang version, so this check fails with clang.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need GNUC > 4 or clang_major > 3 then :)

Btw, if a macro isn't defined, for expample __clang_major__, don't you get a warning for defined(__clang_major) && __clang_major__ > 3 when compiling with -Wundef?

Copy link

@r-devulap r-devulap May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need GNUC > 4 or clang_major > 3 then :)

yeah. But they are so old though :)

Btw, if a macro isn't defined, for expample clang_major, don't you get a warning for defined(__clang_major) && clang_major > 3 when compiling with -Wundef?

Looks like, not if proceeded by #if defined. This seems to be build with gcc main.c -Wundef:

#include <stdio.h>
int main(void) {
#if defined(__clang_major) && __clang_major__ > 3
    printf("The value of __clang__major__ is: %d\n", __clang_major__);
#endif
    return 0;
}

This one throws a warning:

#include <stdio.h>
int main(void) {
#if __clang_major__ > 3
    printf("The value of __clang__major__ is: %d\n", __clang_major__);
#endif
    return 0;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Thanks, I learned something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, the __builtin_cpu_init() function was introduced in Clang with version 6.0, released in 2018. and in GCC with version 4.8, released in 2013.

Signed-off-by: Lipeng Zhu <[email protected]>
@zhulipeng
Copy link
Member Author

zhulipeng commented May 23, 2025

Seems the crash is caused by the ASan with these built-in cpu functions, particularly __builtin_cpu_init, might allocate data on the stack or call other functions that involve stack usage. When called before ASan's full initialization, these stack operations can trigger ASan errors.

A workaround described in google/sanitizers#342 is to add the__attribute__((no_sanitize_address)) to the resolver function.
Or back to the cpuid solution?

#if HAVE_IFUNC && HAVE_X86_SIMD
__attribute__((no_sanitize_address))
static int (*string2ll_resolver(void))(const char *, size_t, long long *) {
    __builtin_cpu_init();
    if (__builtin_cpu_supports("avx512f") &&
        __builtin_cpu_supports("avx512vl") &&
        __builtin_cpu_supports("avx512bw"))
        return string2llAVX512;
    return string2llScalar;
}

Reproduce:

make -j SANITIZER=address # build server with ASan
gdb --args ~/valkey/src/valkey-server ~/valkey.conf

Program received signal SIGSEGV, Segmentation fault.
0x0000000000528d5b in string2ll_resolver () at ~/valkey/src/util.c:630
630         if (__builtin_cpu_supports("avx512f") &&
(gdb) bt
#0  0x0000000000528d5b in string2ll_resolver () at ~/valkey/src/util.c:630
#1  0x00007ffff7fd5de0 in elf_machine_lazy_rel (skip_ifunc=0, reloc=0x450cf8, l_addr=<optimized out>, scope=0x7ffff7ffe580, map=0x7ffff7ffe210)
    at ../sysdeps/x86_64/dl-machine.h:591
#2  elf_dynamic_do_Rela (skip_ifunc=<optimized out>, lazy=<optimized out>, nrelative=<optimized out>, relsize=<optimized out>, reladdr=<optimized out>,
    scope=<optimized out>, map=0x7ffff7ffe210) at /usr/src/debug/glibc-2.34-180.el9.x86_64/elf/do-rel.h:79
#3  _dl_relocate_object (l=l@entry=0x7ffff7ffe210, scope=<optimized out>, reloc_mode=<optimized out>, consider_profiling=<optimized out>, consider_profiling@entry=0)
    at dl-reloc.c:288
#4  0x00007ffff7fe8089 in dl_main (phdr=<optimized out>, phnum=<optimized out>, user_entry=<optimized out>, auxv=<optimized out>) at rtld.c:2461
#5  0x00007ffff7fe3de3 in _dl_sysdep_start (start_argptr=start_argptr@entry=0x7fffffffde70, dl_main=dl_main@entry=0x7ffff7fe5c60 <dl_main>)
    at ../sysdeps/unix/sysv/linux/dl-sysdep.c:140
#6  0x00007ffff7fe5738 in _dl_start_final (arg=0x7fffffffde70) at rtld.c:502
#7  _dl_start (arg=0x7fffffffde70) at rtld.c:587
#8  0x00007ffff7fe4808 in _start () from /lib64/ld-linux-x86-64.so.2

@zhulipeng zhulipeng force-pushed the ifunc branch 2 times, most recently from ee3603a to 7da368e Compare May 23, 2025 04:29
---------

Signed-off-by: Lipeng Zhu <[email protected]>
Co-authored-by: Guo Wangyang <[email protected]>
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to disable ASan.

Good catch about clang 6 for cpu init. It's OK, it's already very old (2018).

Or do you actually like the cpuid solution more? You can try to convince me. 😆 It's more code but less dependencies...

@zhulipeng
Copy link
Member Author

Or do you actually like the cpuid solution more? You can try to convince me. 😆 It's more code but less dependencies

IMO, cpuid is simpler. Using the cpuid instruction directly provides more immediate access and full control over the detection logic compared to the __builtin_cpu_init and __builtin_cpu_supports functions.

While I understand your preference for the __builtin approach 😄, there's a clear trade-off here. The __builtin functions nicely abstract away low-level implementation details, resulting in cleaner, more portable code. However, this abstraction layer may occasionally introduce unexpected behavior due to internal optimizations or compiler-specific implementations, such as conflicts with ASan.

Co-authored-by: Viktor Söderqvist <[email protected]>
Signed-off-by: Lipeng Zhu <[email protected]>
@zuiderkwast
Copy link
Contributor

While I understand your preference for the __builtin approach 😄, there's a clear trade-off here. The __builtin functions nicely abstract away low-level implementation details, resulting in cleaner, more portable code. However, this abstraction layer may occasionally introduce unexpected behavior due to internal optimizations or compiler-specific implementations, such as conflicts with ASan.

Yes, it's a tradeoff.

But we use __builtin in the fallback code that runs without ifunc. It's good that these two checks for AVX use the same style (with and without ifunc). Do you want to use cpuid here too?

We have solved ASan and now. Do you worry about another problem, like another sanitizer or valgrind?

@zuiderkwast zuiderkwast changed the title CPU Feature Detection Optimization Using IFUNC Optimize string2ll with load-time CPU feature check using IFUNC resolver May 23, 2025
@zuiderkwast zuiderkwast merged commit 4092c9c into valkey-io:unstable May 23, 2025
50 of 51 checks passed
@zuiderkwast zuiderkwast added the release-notes This issue should get a line item in the release notes label May 23, 2025
@zhulipeng
Copy link
Member Author

@zuiderkwast @r-devulap Thank you both for your efforts and suggestions on this matter.
@r-devulap Glad to see you in the Valkey community! 🙂

@roshkhatri
Copy link
Member

This commit would break alpine build for valkey as Alpine Linux uses musl-libc instead of GLIBC, and musl-libc does not implement the IFUNC mechanism. just saw the failure in a PR for valkey-conainter as mentioned by alaviss here: valkey-io/valkey-container#70

shanwan1 pushed a commit to shanwan1/valkey that referenced this pull request Jun 13, 2025
…ver (valkey-io#2099)

As we increasingly incorporate SIMD instructions to optimize
performance-critical functions, our current runtime CPU feature
detection pattern is introducing significant overhead. For example, in
the **string2ll** function, the feature detection wrapper adds
approximately 3.35% performance overhead (shown as Fig 1), which is very
significant especially considering the AVX-optimized implementation
itself is highly effective.

Current pattern:
```c
int string2ll(const char *s, size_t slen, long long *value) {
    if (__builtin_cpu_supports("avx512f") &&
        __builtin_cpu_supports("avx512vl") &&
        __builtin_cpu_supports("avx512bw"))
        return string2llAVX512(s, slen, value);
    return string2llScalar(s, slen, value);
}
```

![image](https://github.com/user-attachments/assets/16508faa-cc7d-4437-9de3-09c58522026e)

Fig 1. Profiling of
[memtier_benchmark-1key-list-10K-elements-lpos-integer](https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1key-list-10K-elements-lpos-integer.yml)

This PR implements the **GNU IFUNC** (indirect function) mechanism,
which resolves the optimal function implementation at program load time
rather than checking CPU features during each function call. This
approach eliminates repeated detection overhead in hot code paths.

1. **Performance Improvement**: Eliminates feature detection overhead
(~3.35% for string2ll)
2. **One-time Detection**: CPU capabilities are checked once at load
time
3. **Zero Runtime Overhead**: No conditionals in the call path
4.  **Maintainability**: Cleaner code without conditional branches

The example test suite we can observe a ~3% performance improvement.
|Benchmark|Performance Boost|
|-|-|

|[memtier_benchmark-1key-list-10K-elements-lpos-integer](https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1key-list-10K-elements-lpos-integer.yml)|**3%**|

- OS: CentOS Stream 9
- Platform: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
- Server and Client in same socket

```
taskset -c 0 ~/valkey/src/valkey-server ~/tmp_valkey.conf
port 9001
bind * -::*
daemonize no
protected-mode no
save
```

---------

Signed-off-by: Lipeng Zhu <[email protected]>
Signed-off-by: Lipeng Zhu <[email protected]>
Co-authored-by: Guo Wangyang <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance release-notes This issue should get a line item in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants