Skip to content

Client argv object pool#13726

Open
ShooterIT wants to merge 5 commits into
redis:unstablefrom
ShooterIT:argv-ojb-pool-v2
Open

Client argv object pool#13726
ShooterIT wants to merge 5 commits into
redis:unstablefrom
ShooterIT:argv-ojb-pool-v2

Conversation

@ShooterIT

@ShooterIT ShooterIT commented Jan 6, 2025

Copy link
Copy Markdown
Member

Introduction

From flame graph, we can see redis costs much CPU to create argv objects when parsing requests and free argv objects after command processed. For most commands, argv objects just are temporary variables and their size is a little small, so we create string object pools for client argv object to avoid memory allocation and deallocation, and these string objects use EMBSTR encoding to avoid memory waste.

As the header of string object with EMBSTR encoding has the fixed size of 20 bytes, and memory allocators are typically aligned to 8 bytes, the size of string object with EMBSTR encoding will be 24, 32, 40, 48, 56, or 64 bytes, so the significant length for storing data is 4, 12, 20, 28, 36, or 44 bytes. We create pools of string objects with different sizes, now we have 6 pools to cover all EMBSTR encoding string object with different sizes. Max total memory of all pool is 256*(24+32+40+48+56+64)= 67,584, init memory is 16*(24+32+40+48+56+64)=4,224.

For the implementation of client argv object pool, we adopt the circle queue method, we get an object from the object queue when allocing a new object for client argv, and try to return an object to the object queue when freeing a client argv object.

Testing

Single Thread
About 10% improvement for hset pipeline mode, 2.7% for set,get pipeline mode

memtier_benchmark --test-time 180 "--pipeline" "10" "--data-size" "10" --command "HSET __key__ field1 __data__ field2 __data__ field3 __data__ field4 __data__ field5
 __data__" --command-key-pattern="P" --key-minimum=1 --key-maximum 10000000 -c 50 -t 4 --hide-histogram

Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Hsets      433628.00         4.60237         3.57500         9.94300        83.39100     78717.03 
Totals     433628.00         4.60237         3.57500         9.94300        83.39100    157434.06 

Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Hsets      479008.93         4.16619         3.30300         9.27100        82.30300     86956.29 
Totals     479008.93         4.16619         3.30300         9.27100        82.30300    173912.57



memtier_benchmark --test-time 180 "--pipeline" "10" --ratio 1:1 --key-pattern P:P --key-minimum=1 --key-maximum 3000000  -c 50 -t 4 --hide-histogram

Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets       415337.97          ---          ---         2.40386         2.27900         4.70300        33.47100     31892.46 
Gets       415337.97    415337.97         0.00         2.40386         2.27900         4.70300        33.47100     29864.45 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     830675.94    415337.97         0.00         2.40386         2.27900         4.70300        33.47100     61756.91

Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets       426810.15          ---          ---         2.33951         2.23900         4.56700        32.38300     32772.98 
Gets       426810.15    426810.15         0.00         2.33951         2.23900         4.56700        32.38300     30688.94 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     853620.30    426810.15         0.00         2.33951         2.23900         4.56700        32.38300     63461.92 

Multi IO Thread
About 13.6% improvement for hset pipeline mode, 3.85% for set,get pipeline mode

memtier_benchmark --test-time 180 "--pipeline" "10" "--data-size" "10" --command "HSET __key__ field1 __data__ field2 __data__ field3 __data__ field4 __data__ field5
 __data__" --command-key-pattern="P" --key-minimum=1 --key-maximum 10000000 -c 50 -t 4 --hide-histogram

Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Hsets      526326.80         3.79103         3.67900         7.96700        80.63900     95545.21 
Totals     526326.80         3.79103         3.67900         7.96700        80.63900    191090.42

Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Hsets      598091.31         3.33615         3.14300         7.31100        79.29500    108572.84 
Totals     598091.31         3.33615         3.14300         7.31100        79.29500    217145.69 


memtier_benchmark --test-time 180 "--pipeline" "10" --ratio 1:1 --key-pattern P:P --key-minimum=1 --key-maximum 3000000  -c 50 -t 4 --hide-histogram
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets       608945.86          ---          ---         1.63880         1.43900         2.56700        28.33500     46759.47 
Gets       608945.86    608945.86         0.00         1.63880         1.43900         2.56700        28.33500     43786.10 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals    1217891.72    608945.86         0.00         1.63880         1.43900         2.56700        28.33500     90545.57

Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets       632439.93          ---          ---         1.57785         1.37500         2.49500        27.67900     48562.86 
Gets       632439.93    632439.93         0.00         1.57785         1.37500         2.49500        27.67900     45474.77 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals    1264879.86    632439.93         0.00         1.57785         1.37500         2.49500        27.67900     94037.63

Comment thread src/networking.c
@ShooterIT

Copy link
Copy Markdown
Member Author

Hi @oranagra what do you think this design, as we know, Valkey offloads freeing argv objects to io threads valkey-io/valkey#763, it needs to deliver jobs to io threads, for redis, io threads use event driven instead of busy loop, it brings unnecessary communication, and it only brings benefit for multi io threads. So we introduce argv object pools to avoid allocing and freeing argv objects, both single thread and multi io threads have improvement.

maybe after batch parsing commands in pipeline mode, the io thread also can its own argv object pools.

@oranagra

oranagra commented Jan 7, 2025

Copy link
Copy Markdown
Member

putting aside io-threads for a moment, considering jemalloc has thread-cache to keep recently freed allocations fast to re-use, i wonder why our allocations are so costly. maybe it's something we do (like memory accounting)?
before creating a mechanism to keep re-use pulls into redis, i'd rather understand that better.

@ShooterIT

Copy link
Copy Markdown
Member Author

Hi @oranagra the flame graph as bellow, memory accounting (je_malloc_usable_size) did take some time. I am not clear about the internal mechanism of jemalloc. Between argv objects parsing and release, we will also apply and release a lot of memory during command processing, not sure if this will bring impact.

create argv object part
unstable:
截屏2025-01-08 11 13 57

PR:
截屏2025-01-08 11 11 57

free argv objects part
unstable:
截屏2025-01-08 10 10 23
PR:
截屏2025-01-08 11 15 56

Comment thread src/object.c
@oranagra

oranagra commented Jan 8, 2025

Copy link
Copy Markdown
Member

i imagine that when allocating or releasing memory, jemalloc knows what's it's serving and can give that info for free.
maybe we can try to use allocm instead of using malloc_usable_size and check how much we can gain.
can you try that?

@ShooterIT

Copy link
Copy Markdown
Member Author

Hi @oranagra it seems jemalloc remove this feature from 4.0

Removed features:

  • Remove the *allocm() API, which is superseded by the *allocx() API.

je_mallocx(size_t size, int flags) can't support to return the size

@oranagra

oranagra commented Jan 9, 2025

Copy link
Copy Markdown
Member

well, i don't like to keep caching things, and i'd prefer to keep it as last resort.
before merging this, i'd like to invest some effort in a more general speedup.
considering that the tcache should have speed up re-use of objects, and considering that we use a static version of jemalloc that's entirely in our control, if the only problem is our obsession about calling malloc_usable_size, i'd like to try to invest a little bit of hacking effort to look into getting that data faster.

maybe someone can spend a day or two of hacking trying to add an allocation and release APIs to our jemalloc that provide that info.
see if it's possible and easy, and measure the performance impact?
@YaacovHazan WDYT?

@ShooterIT

Copy link
Copy Markdown
Member Author

There are two points to consider

  • calling malloc_usable_size when creating and free argv objects just costs about 1.66% in hset pipeline mode, as above
  • returning the size of memory when mallocing and freeing seems not easy, i ever tried, jemalloc has many fast paths, we might change multiple functions to support. Of course I am not an expert on jemalloc and may have misestimated the complexity

@oranagra

oranagra commented Jan 9, 2025

Copy link
Copy Markdown
Member

so you're saying this could give an improvement of only 1.66%, while the current content of the PR gives 10% improvement?
i just can't make sense of it, considering how i imagine the jemalloc tcache works.

maybe we don't understand some implications of what we measure.
maybe worth doing an experiment with CFLAGS=-DNO_MALLOC_USABLE_SIZE?

my preference was to try and find something generic that can maybe improve all allocations, rather than just argument parsing, and also avoid adding this complexity.
also note that this is memory we'll never free, and can end up not re-using, and may need to defragment.

but if the specific value here is high, and we can't find something generic, i suppose i'll be ok with it.

@ShooterIT

Copy link
Copy Markdown
Member Author

Hi @oranagra here are some updates

so you're saying this could give an improvement of only 1.66%

from flame graph, it is.

maybe worth doing an experiment with CFLAGS=-DNO_MALLOC_USABLE_SIZE?

i did, the result is strange, the performance dropped by nearly half. rtree_metadata_read costs much.

截屏2025-01-13 16 26 31

try and find something generic that can maybe improve all allocations

yes, i think so, i ever tried when i asked to look at valkey PR: valkey-io/valkey#453 , but i can't find easy way to do that, je_malloc/free have several branches, it is hard for me to safely change this part of the code. and there seems to be no good way to update memory usage through the hook function.

I added a test and modified the code to avoid calling zmalloc_size, which improved performance somewhat, but it’s still not as good as the argv pool. The updated code is as follows:

--- a/src/zmalloc.c
+++ b/src/zmalloc.c
@@ -122,9 +122,11 @@ static inline void *ztrymalloc_usable_internal(size_t size, size_t *usable) {
 
     if (!ptr) return NULL;
 #ifdef HAVE_MALLOC_SIZE
-    size = zmalloc_size(ptr);
-    update_zmalloc_stat_alloc(size);
-    if (usable) *usable = size;
+    if (usable) {
+       size = zmalloc_size(ptr);
+       update_zmalloc_stat_alloc(size);
+       *usable = size;         
+    }
     return ptr;
 #else
     size = MALLOC_MIN_SIZE(size);
@@ -423,7 +425,7 @@ void zfree(void *ptr) {
 
     if (ptr == NULL) return;
 #ifdef HAVE_MALLOC_SIZE
-    update_zmalloc_stat_free(zmalloc_size(ptr));
+    //update_zmalloc_stat_free(zmalloc_size(ptr));
     free(ptr);
 #else
     realptr = (char*)ptr-PREFIX_SIZE;

benchmark command

memtier_benchmark --test-time 180 "--pipeline" "10" "--data-size" "10" --command "HSET __key__ field1 __data__ field2 __data__ field3 __data__ field4 __data__ field5
 __data__" --command-key-pattern="P" --key-minimum=1 --key-maximum 10000000 -c 50 -t 4 --hide-histogram

unstable

==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Hsets      447361.28         4.46132         3.39900         9.58300        81.40700     81209.00 
Totals     447361.28         4.46132         3.39900         9.58300        81.40700    162417.99

unstable without updating memory usage

==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Hsets      486639.05         4.10110         3.18300         8.83900        80.31900     88343.04 
Totals     486639.05         4.10110         3.18300         8.83900        80.31900    176686.09

argv pool

==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Hsets      505491.10         3.95784         2.99900         8.69500        79.93500     91762.86 
Totals     505491.10         3.95784         2.99900         8.69500        79.93500    183525.73 

and i got the flame graph of unstable without updating memory usage, as bellow, seems createEmbeddedStringObject and zfree also cost much
perf_nosize

@oranagra

Copy link
Copy Markdown
Member

maybe worth doing an experiment with CFLAGS=-DNO_MALLOC_USABLE_SIZE?

i did, the result is strange, the performance dropped by nearly half. rtree_metadata_read costs much.

that's odd, i see je_malloc_usable_size in the flame graph, but this flag is meant to avoid using it.

I added a test and modified the code to avoid calling zmalloc_size, which improved performance somewhat, but it’s still not as good as the argv pool.

so the improvement in this PR is not just because we don't call zmalloc_size, but also because our caching is considerably more efficient for our case than relying on jemalloc tcache?

bottom line, i'm nearly ready to give up, but before, i'd still rather understand why NO_MALLOC_USABLE_SIZE caused a regression compared do your temporary code modification, and why it still uses je_mallc_usable_size

@ShooterIT

Copy link
Copy Markdown
Member Author

Hi @oranagra for compiling with CFLAGS=-DNO_MALLOC_USABLE_SIZE, there may be an error, but i don't know why, maybe the VM i bought has some errors, then I bought two other VMs(intel and amd), there is no performance degradation with CFLAGS=-DNO_MALLOC_USABLE_SIZE.

Here are test results by benchmark commands as above
unstable : 455704.01
no_malloc_usable_size : 455942.95
without_memory_update: 485220.54
this PR : 502931.14

P.S. without_memory_update the code modification is as shown above, it influences all memory allocation/freeing and atomic variables update.

From flame graph, creating argv objects and free argv objects costs much, that is the motivation i want to try. since reusing argv can improve performance, i think resuing argv objects may also can. For argv parsing, we may want zero-copy, as @yossigo suggested in io-thread internal doc, i tried a demo stringview, but it didn't bring improvement, i think argv generally is small in redis query and memcpy is well optimized. So i think avoid create/free argv objects may be right way.

for jemalloc tcache implementation, i am not familiar with it. for our implementation of argv object pool, it just has operations on an array, is very simple.

@oranagra

Copy link
Copy Markdown
Member

i'm not certain i was able to follow everything.
so the bottom line is that the regression of using NO_MALLOC_USABLE_SIZE which you observed before, is gone.
but still, it didn't provide any improvement which i was hoping to use in order to rely on tcache instead of our argv cache.
did it provide any benefit?

@ShooterIT

Copy link
Copy Markdown
Member Author

so the bottom line is that the regression of using NO_MALLOC_USABLE_SIZE which you observed before, is gone.

yes, the regression is gone

but still, it didn't provide any improvement which i was hoping to use in order to rely on tcache instead of our argv cache.
did it provide any benefit?

Basically no difference, the benefit is very small, I've run several times, the benefit usually is less than 1%.
as above, this is one of the comparisons

unstable : 455704.01
no_malloc_usable_size : 455942.95

@oranagra

Copy link
Copy Markdown
Member

Sorry. I only now bothered to open the linked valkey pr and realize why the jemalloc tcache isn't serving us here.
We keep disposing these into the main thread tcache, so when we allocate new ones from the io thread, we have to fetch them from the arena.

By using our own pool, we bypass that.
I'd rather try to somehow return them to the original tcache.
Let's look into it a bit more.

@ShooterIT

Copy link
Copy Markdown
Member Author

We keep disposing these into the main thread tcache, so when we allocate new ones from the io thread, we have to fetch them from the arena.

for single thread, this PR brings about 10% improvement for hset pipeline mode, 2.7% for set,get pipeline mode as above, all tests i shared with your are based on single thread. Even in multi-threaded mode, we only allocate and release the object of the first command of each client as you said way. Other commands in the pipeline are also allocated and released from the main thread.

for multi threads, this PR is compatible, but not well optimized, as i said, maybe after batch parsing commands in pipeline mode, the io thread also can its own argv object pools.

@oranagra

Copy link
Copy Markdown
Member

i'm sorry i'm a little bit distracted.
so the improvement you shown is with a single thread, and it doesn't really work with mult-thread and pipeline, so how is that related to the linked valkey-io/valkey#763?

AFAIU they did that change so that the jemalloc tcache is doing what it was designed to do. and what we're doing is to replace it with our own mechanism. but what you mentioned now about the benefit when using a single thread, means that for some reason it's more efficient than the tcache. i'm a bit confused.

FYI, i'm currently working on batch parsing of commands ahead of their execution.
i.e. a client has an array of pending_commands that it first parses and then executes.

@ShooterIT

ShooterIT commented Jan 20, 2025

Copy link
Copy Markdown
Member Author

mult-thread and pipeline

For this mode, this PR also can bring improvement, since now just the first command in pipeline is parsed by IO thread, the reset commands are parsed by main thread.

FYI, i'm currently working on batch parsing of commands ahead of their execution.
i.e. a client has an array of pending_commands that it first parses and then executes.

AFAIK @sundb also is working on this.
maybe after this feature, let’s revisit this issue again

@oranagra

Copy link
Copy Markdown
Member

so the improvement you shown is with a single thread, and it doesn't really work with mult-thread and pipeline, so how is that related to the linked valkey-io/valkey#763?

For this mode, this PR also can bring improvement, since now just the first command in pipeline is parsed by IO thread, the reset commands are parsed by main thread.

i must have meant multi-threaded without pipeline.
or does it also improve this case?

in any case, i must say i don't like this direction too much, what val-key did, they had to do because io-threads work against the tcache, and i think we'll have to do that too, at least for the cases in which this PR doesn't kick in (larger sizes)

if we'd want to consider this extra optimization for certain cases, we can, but at this point i don't like it because:

  1. it hides the tcache violation problem for some sizes, but leaves it for others.
  2. i don't understand why caching the full object is that much better than jemalloc tcache
  3. if we have an efficiency problems with allocations, i rather some general solution, and not start adding specific hacks to every subsystem.

@ShooterIT

Copy link
Copy Markdown
Member Author

i must have meant multi-threaded without pipeline.
or does it also improve this case?

no, this pr doesn't. since the batch parsing of commands is on the way, these related code may be changed. Initially, we want not to free argv objects after executing if the client is from io thread, we just only add a flag, and then the client comes back to io thread, we finally free argv objects. and after finishing batch parsing of commands, we can also use argv pool in io thread, or not, by performance testing.

@ShooterIT

Copy link
Copy Markdown
Member Author

OK, we can leave this PR

@oranagra

Copy link
Copy Markdown
Member

maybe we don't have to leave it. maybe it's just me judging it with the wrong context (i didn't even look at the code yet).
maybe i got mislead by either my own opinion on tcache and malloc_usable_size, or by the reference to the valkey change.

it's important to make a distinction here, when compared to redis 7.4, this change is a plain optimization, not directly related to any other change.
or is it an important fix, to counter the implications of another change we made.

i.e. if it's an optimization, then i don't like to keep these caching attempts in random places, and instead i'd like to understand why are our allocations so costly.
on the other hand, if we have a regression for abusing the tcache (like the text in that valkey PR suggests), i think we must fix it (in this PR, or another one)

maybe we need to re-group, define the context and start over.
or maybe find another reviewer, that's not so distracted 😝

tezc added a commit that referenced this pull request May 9, 2026
While profiling command execution, I noticed that command argv object
alloc/free overhead is quite high for workloads with many small
arguments (e.g. `HSET` with many fields). The effect is much more
visible with pipelining when Redis becomes CPU bound.

I experimented with replacing argv object alloc/free with a simple
object pool and saw significant speedups.
(Note: related effort around this topic:
#13726)

In this PR, I tried to improve the main hotspots in the memory
allocation path (focusing on command arg allocations) to close the gap
with custom pool performance, so we can avoid having a dedicated memory
pools and let the whole codebase benefit from these optimizations.

## Changes

### 1) Faster dealloc via passing size hint to jemalloc (separate PR
#15071)
Jemalloc does more work than an object pool on free (a lookup on a tree
to find the allocation's size class). For some deallocations, we can
reduce free path overhead by passing a size hint to jemalloc (i.e.
`sdallocx()`) which can skip metadata lookup in the common case. This PR
introduces `zfree_with_size()` and uses it where we can know the
allocation size i.e. `OBJ_ENCODING_EMBSTR` objects in `decrRefCount()`
and SDS free path.

### 2) Reduce atomic operation cost for stat updates
`update_zmalloc_stat_alloc()` / `update_zmalloc_stat_free()` previously
used atomic read-modify-write (RMW) operations (`atomicIncrGet` /
`atomicDecr`) which can emit expensive locked instructions on x86.

When we can guarantee a single writer to a counter, we can use a cheaper
load+add+store sequence instead of a locked RMW. This PR gives the first
16 threads dedicated slots for used_memory stats (intended to cover the
main thread/ I/O threads) so they can use this single writer fast path.
Threads beyond that fall back to a shared pool and continue to use full
atomic RMW.

### 3) Improve jemalloc tcache hit rate 

With the default `lookahead=16` config, a pipelined HSET with ~20 fields
does ~40 small allocations per command (fields + values), so you can get
16 x 40 = ~640 allocations. When args are small, many of these land in
the 32 byte size class (often `EMBSTR`). Jemalloc’s default per-bin
tcache cap is 200, so this kind of burst overflows the cache and it does
frequent flushes. I raised the small-bin tcache limits
(lg_tcache_nslots_mul:3, tcache_nslots_small_max:1000) to handle these
bursts better. In the worst case, tcache may have a higher memory usage
due to this change. Perhaps, another option was lowering `lookahead` to
tune it differently.

### 4) Inlining
When you have a simple pool, it has a few small functions and it is easy
for compiler to inline them. Compared to that, jemalloc alloc/free path
has a deeper call stack. Also, jemalloc was not compiled with `-flto`
which was preventing inlining jemalloc functions. As part of this PR, I
added `-flto` flag to jemalloc when it is enabled for Redis.

Compiler also chooses not to inline some hot path functions in Redis.
This suggests PGO (profile-guided optimization) could provide additional
wins and perhaps we can start experimenting with it sometime. We could
try to force inlining with attributes like `always_inline` but it is
hard to apply across a deep call stack and misuse can cause code bloat.
So, rather than going in this direction, I added `inline` keyword to
some functions for now. This doesn't make compiler to inline all hot
path functions but at least it is a step ahead. (If we can further
improve this in future, performance gets very close to custom memory
pool implementation).

## Benchmark results

Commands were like:

```
memtier_benchmark   --command="HSET __key__ username john_doe email [email protected] password hashed_pwd_123 created_at 1709125200 updated_at 1709125200 first_name John last_name Doe phone_number +1234567890 address 123_Main_St city NewYork country USA postal_code 10001 company Acme_Corp job_title Engineer bio Loves_coding"   --command-ratio=1   --command-key-pattern=P   --key-prefix="hsetkey"   --key-minimum=1   --key-maximum=100000   -n 1000000   -c 50   -t 2   --hide-histogram --pipeline 50
```

| Benchmark | Improvement |
| --- | ---: |
| SET | +0% |
| SET (pipeline) | +8% |
| HSET 15 fields | +2% |
| HSET 15 fields (pipeline) | +17% |
| ZADD 15 elements| +3% |
| ZADD 15 elements (pipeline) | +15% |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants