Client argv object pool#13726
Conversation
|
Hi @oranagra what do you think this design, as we know, Valkey offloads freeing argv objects to io threads valkey-io/valkey#763, it needs to deliver jobs to io threads, for redis, io threads use event driven instead of busy loop, it brings unnecessary communication, and it only brings benefit for multi io threads. So we introduce argv object pools to avoid allocing and freeing argv objects, both single thread and multi io threads have improvement. maybe after batch parsing commands in pipeline mode, the io thread also can its own argv object pools. |
f5721be to
0287148
Compare
|
putting aside io-threads for a moment, considering jemalloc has thread-cache to keep recently freed allocations fast to re-use, i wonder why our allocations are so costly. maybe it's something we do (like memory accounting)? |
|
Hi @oranagra the flame graph as bellow, memory accounting (je_malloc_usable_size) did take some time. I am not clear about the internal mechanism of jemalloc. Between argv objects parsing and release, we will also apply and release a lot of memory during command processing, not sure if this will bring impact. |
|
i imagine that when allocating or releasing memory, jemalloc knows what's it's serving and can give that info for free. |
|
Hi @oranagra it seems jemalloc remove this feature from 4.0
je_mallocx(size_t size, int flags) can't support to return the size |
|
well, i don't like to keep caching things, and i'd prefer to keep it as last resort. maybe someone can spend a day or two of hacking trying to add an allocation and release APIs to our jemalloc that provide that info. |
|
There are two points to consider
|
|
so you're saying this could give an improvement of only 1.66%, while the current content of the PR gives 10% improvement? maybe we don't understand some implications of what we measure. my preference was to try and find something generic that can maybe improve all allocations, rather than just argument parsing, and also avoid adding this complexity. but if the specific value here is high, and we can't find something generic, i suppose i'll be ok with it. |
|
Hi @oranagra here are some updates
from flame graph, it is.
i did, the result is strange, the performance dropped by nearly half.
yes, i think so, i ever tried when i asked to look at valkey PR: valkey-io/valkey#453 , but i can't find easy way to do that, je_malloc/free have several branches, it is hard for me to safely change this part of the code. and there seems to be no good way to update memory usage through the hook function. I added a test and modified the code to avoid calling --- a/src/zmalloc.c
+++ b/src/zmalloc.c
@@ -122,9 +122,11 @@ static inline void *ztrymalloc_usable_internal(size_t size, size_t *usable) {
if (!ptr) return NULL;
#ifdef HAVE_MALLOC_SIZE
- size = zmalloc_size(ptr);
- update_zmalloc_stat_alloc(size);
- if (usable) *usable = size;
+ if (usable) {
+ size = zmalloc_size(ptr);
+ update_zmalloc_stat_alloc(size);
+ *usable = size;
+ }
return ptr;
#else
size = MALLOC_MIN_SIZE(size);
@@ -423,7 +425,7 @@ void zfree(void *ptr) {
if (ptr == NULL) return;
#ifdef HAVE_MALLOC_SIZE
- update_zmalloc_stat_free(zmalloc_size(ptr));
+ //update_zmalloc_stat_free(zmalloc_size(ptr));
free(ptr);
#else
realptr = (char*)ptr-PREFIX_SIZE;benchmark command unstable unstable without updating memory usage argv pool and i got the flame graph of unstable without updating memory usage, as bellow, seems |
that's odd, i see je_malloc_usable_size in the flame graph, but this flag is meant to avoid using it.
so the improvement in this PR is not just because we don't call zmalloc_size, but also because our caching is considerably more efficient for our case than relying on jemalloc tcache? bottom line, i'm nearly ready to give up, but before, i'd still rather understand why NO_MALLOC_USABLE_SIZE caused a regression compared do your temporary code modification, and why it still uses je_mallc_usable_size |
|
Hi @oranagra for compiling with CFLAGS=-DNO_MALLOC_USABLE_SIZE, there may be an error, but i don't know why, maybe the VM i bought has some errors, then I bought two other VMs(intel and amd), there is no performance degradation with CFLAGS=-DNO_MALLOC_USABLE_SIZE. Here are test results by benchmark commands as above P.S. From flame graph, creating argv objects and free argv objects costs much, that is the motivation i want to try. since reusing argv can improve performance, i think resuing argv objects may also can. For argv parsing, we may want zero-copy, as @yossigo suggested in io-thread internal doc, i tried a demo stringview, but it didn't bring improvement, i think argv generally is small in redis query and memcpy is well optimized. So i think avoid create/free argv objects may be right way. for jemalloc tcache implementation, i am not familiar with it. for our implementation of argv object pool, it just has operations on an array, is very simple. |
|
i'm not certain i was able to follow everything. |
yes, the regression is gone
Basically no difference, the benefit is very small, I've run several times, the benefit usually is less than 1%. |
|
Sorry. I only now bothered to open the linked valkey pr and realize why the jemalloc tcache isn't serving us here. By using our own pool, we bypass that. |
for single thread, this PR brings about 10% improvement for hset pipeline mode, 2.7% for set,get pipeline mode as above, all tests i shared with your are based on single thread. Even in multi-threaded mode, we only allocate and release the object of the first command of each client as you said way. Other commands in the pipeline are also allocated and released from the main thread. for multi threads, this PR is compatible, but not well optimized, as i said, maybe after batch parsing commands in pipeline mode, the io thread also can its own argv object pools. |
|
i'm sorry i'm a little bit distracted. AFAIU they did that change so that the jemalloc tcache is doing what it was designed to do. and what we're doing is to replace it with our own mechanism. but what you mentioned now about the benefit when using a single thread, means that for some reason it's more efficient than the tcache. i'm a bit confused. FYI, i'm currently working on batch parsing of commands ahead of their execution. |
For this mode, this PR also can bring improvement, since now just the first command in pipeline is parsed by IO thread, the reset commands are parsed by main thread.
AFAIK @sundb also is working on this. |
i must have meant multi-threaded without pipeline. in any case, i must say i don't like this direction too much, what val-key did, they had to do because io-threads work against the tcache, and i think we'll have to do that too, at least for the cases in which this PR doesn't kick in (larger sizes) if we'd want to consider this extra optimization for certain cases, we can, but at this point i don't like it because:
|
no, this pr doesn't. since the batch parsing of commands is on the way, these related code may be changed. Initially, we want not to free argv objects after executing if the client is from io thread, we just only add a flag, and then the client comes back to io thread, we finally free argv objects. and after finishing batch parsing of commands, we can also use argv pool in io thread, or not, by performance testing. |
|
OK, we can leave this PR |
|
maybe we don't have to leave it. maybe it's just me judging it with the wrong context (i didn't even look at the code yet). it's important to make a distinction here, when compared to redis 7.4, this change is a plain optimization, not directly related to any other change. i.e. if it's an optimization, then i don't like to keep these caching attempts in random places, and instead i'd like to understand why are our allocations so costly. maybe we need to re-group, define the context and start over. |
While profiling command execution, I noticed that command argv object alloc/free overhead is quite high for workloads with many small arguments (e.g. `HSET` with many fields). The effect is much more visible with pipelining when Redis becomes CPU bound. I experimented with replacing argv object alloc/free with a simple object pool and saw significant speedups. (Note: related effort around this topic: #13726) In this PR, I tried to improve the main hotspots in the memory allocation path (focusing on command arg allocations) to close the gap with custom pool performance, so we can avoid having a dedicated memory pools and let the whole codebase benefit from these optimizations. ## Changes ### 1) Faster dealloc via passing size hint to jemalloc (separate PR #15071) Jemalloc does more work than an object pool on free (a lookup on a tree to find the allocation's size class). For some deallocations, we can reduce free path overhead by passing a size hint to jemalloc (i.e. `sdallocx()`) which can skip metadata lookup in the common case. This PR introduces `zfree_with_size()` and uses it where we can know the allocation size i.e. `OBJ_ENCODING_EMBSTR` objects in `decrRefCount()` and SDS free path. ### 2) Reduce atomic operation cost for stat updates `update_zmalloc_stat_alloc()` / `update_zmalloc_stat_free()` previously used atomic read-modify-write (RMW) operations (`atomicIncrGet` / `atomicDecr`) which can emit expensive locked instructions on x86. When we can guarantee a single writer to a counter, we can use a cheaper load+add+store sequence instead of a locked RMW. This PR gives the first 16 threads dedicated slots for used_memory stats (intended to cover the main thread/ I/O threads) so they can use this single writer fast path. Threads beyond that fall back to a shared pool and continue to use full atomic RMW. ### 3) Improve jemalloc tcache hit rate With the default `lookahead=16` config, a pipelined HSET with ~20 fields does ~40 small allocations per command (fields + values), so you can get 16 x 40 = ~640 allocations. When args are small, many of these land in the 32 byte size class (often `EMBSTR`). Jemalloc’s default per-bin tcache cap is 200, so this kind of burst overflows the cache and it does frequent flushes. I raised the small-bin tcache limits (lg_tcache_nslots_mul:3, tcache_nslots_small_max:1000) to handle these bursts better. In the worst case, tcache may have a higher memory usage due to this change. Perhaps, another option was lowering `lookahead` to tune it differently. ### 4) Inlining When you have a simple pool, it has a few small functions and it is easy for compiler to inline them. Compared to that, jemalloc alloc/free path has a deeper call stack. Also, jemalloc was not compiled with `-flto` which was preventing inlining jemalloc functions. As part of this PR, I added `-flto` flag to jemalloc when it is enabled for Redis. Compiler also chooses not to inline some hot path functions in Redis. This suggests PGO (profile-guided optimization) could provide additional wins and perhaps we can start experimenting with it sometime. We could try to force inlining with attributes like `always_inline` but it is hard to apply across a deep call stack and misuse can cause code bloat. So, rather than going in this direction, I added `inline` keyword to some functions for now. This doesn't make compiler to inline all hot path functions but at least it is a step ahead. (If we can further improve this in future, performance gets very close to custom memory pool implementation). ## Benchmark results Commands were like: ``` memtier_benchmark --command="HSET __key__ username john_doe email [email protected] password hashed_pwd_123 created_at 1709125200 updated_at 1709125200 first_name John last_name Doe phone_number +1234567890 address 123_Main_St city NewYork country USA postal_code 10001 company Acme_Corp job_title Engineer bio Loves_coding" --command-ratio=1 --command-key-pattern=P --key-prefix="hsetkey" --key-minimum=1 --key-maximum=100000 -n 1000000 -c 50 -t 2 --hide-histogram --pipeline 50 ``` | Benchmark | Improvement | | --- | ---: | | SET | +0% | | SET (pipeline) | +8% | | HSET 15 fields | +2% | | HSET 15 fields (pipeline) | +17% | | ZADD 15 elements| +3% | | ZADD 15 elements (pipeline) | +15% |





Introduction
From flame graph, we can see redis costs much CPU to create argv objects when parsing requests and free argv objects after command processed. For most commands, argv objects just are temporary variables and their size is a little small, so we create string object pools for client argv object to avoid memory allocation and deallocation, and these string objects use EMBSTR encoding to avoid memory waste.
As the header of string object with EMBSTR encoding has the fixed size of 20 bytes, and memory allocators are typically aligned to 8 bytes, the size of string object with EMBSTR encoding will be 24, 32, 40, 48, 56, or 64 bytes, so the significant length for storing data is 4, 12, 20, 28, 36, or 44 bytes. We create pools of string objects with different sizes, now we have 6 pools to cover all EMBSTR encoding string object with different sizes. Max total memory of all pool is 256*(24+32+40+48+56+64)= 67,584, init memory is 16*(24+32+40+48+56+64)=4,224.
For the implementation of client argv object pool, we adopt the circle queue method, we get an object from the object queue when allocing a new object for client argv, and try to return an object to the object queue when freeing a client argv object.
Testing
Single Thread
About 10% improvement for
hsetpipeline mode, 2.7% forset,getpipeline modeMulti IO Thread
About 13.6% improvement for
hsetpipeline mode, 3.85% forset,getpipeline mode