reduce getNodeByQuery CPU time by using less cache lines (from 2064 Bytes struct to 64 Bytes): reduces LLC misses and Memory Loads#13296
Conversation
04ce2f5 to
9462436
Compare
sundb
left a comment
There was a problem hiding this comment.
LGTM, the only concern is whether is might affect commands like command getkeys?
when there are more than 6 arguments we will allocate memory for it each time.
|
A benchmark test was done when the number of keys was greater than 6. |
|
@sundb take your first example (./src/redis-benchmark -n 20000000 -P 100 command getkeys mget a b c d e f g out of a collection of 15secs, there is 1sec difference just on the bellow 2 functions.
|
Upgrade urgency LOW: This is the second Release Candidate for Redis 7.4. Performance and resource utilization improvements ================================================= * #13296 Optimize CPU cache efficiency Changes to new 7.4 new features (compared to 7.4 RC1) ===================================================== * #13343 Hash - expiration of individual fields: when key does not exist - reply with an array (nonexisting code for each field) * #13329 Hash - expiration of individual fields: new keyspace event: `hexpired` Modules API - Potentially breaking changes to new 7.4 features (compared to 7.4 RC1) ==================================================================================== * #13326 Hash - expiration of individual fields: avoid lazy expire when called from a Modules API function
…ytes struct to 64 Bytes): reduces LLC misses and Memory Loads (redis#13296) The following PR goes from 33 cacheline on getKeysResult struct (by default has 256 static buffer) ``` root@hpe10:~/redis# pahole -p ./src/server.o -C getKeysResult typedef struct { keyReference keysbuf[256]; /* 0 2048 */ /* --- cacheline 32 boundary (2048 bytes) --- */ /* typedef keyReference */ struct { int pos; int flags; } *keys; /* 2048 8 */ int numkeys; /* 2056 4 */ int size; /* 2060 4 */ /* size: 2064, cachelines: 33, members: 4 */ /* last cacheline: 16 bytes */ } getKeysResult; ``` to 1 cacheline with a static buffer of 6 keys per command): ``` root@hpe10:~/redis# pahole -p ./src/server.o -C getKeysResult typedef struct { int numkeys; /* 0 4 */ int size; /* 4 4 */ keyReference keysbuf[6]; /* 8 48 */ /* typedef keyReference */ struct { int pos; int flags; } *keys; /* 56 8 */ /* size: 64, cachelines: 1, members: 4 */ } getKeysResult; ``` we get around 1.5% higher ops/sec, and a confirmation of around 15% less LLC loads on getNodeByQuery and 37% less Stores. Function / Call Stack | CPU Time: Difference | CPU Time: 9462436 | CPU Time: this PR | Loads: Difference | Loads: 9462436 | Loads: this PR | Stores: Difference | Stores: 9462436 | Stores: This PR -- | -- | -- | -- | -- | -- | -- | -- | -- | -- getNodeByQuery | 0.753767 | 1.57118 | 0.817416 | 144297829 (15% less loads) | 920575969 | 776278140 | 367607824 (37% less stores) | 991642384 | 624034560 ## results on client side ### baseline ``` taskset -c 2,3 memtier_benchmark -s 192.168.1.200 --port 6379 --authenticate perf --cluster-mode --pipeline 10 --data-size 100 --ratio 1:0 --key-pattern P:P --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 25 -t 2 --hide-histogram Writing results to stdout [RUN redis#1] Preparing benchmark client... [RUN redis#1] Launching threads now... [RUN redis#1 100%, 180 secs] 0 threads: 110333450 ops, 604992 (avg: 612942) ops/sec, 84.75MB/sec (avg: 85.86MB/sec), 0.82 (avg: 0.81) msec latency 2 Threads 25 Connections per thread 180 Seconds ALL STATS ====================================================================================================================================================== Type Ops/sec Hits/sec Misses/sec MOVED/sec ASK/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ------------------------------------------------------------------------------------------------------------------------------------------------------ Sets 612942.14 --- --- 0.00 0.00 0.81332 0.80700 1.26300 2.92700 87924.12 Gets 0.00 0.00 0.00 0.00 0.00 --- --- --- --- 0.00 Waits 0.00 --- --- --- --- --- --- --- --- --- Totals 612942.14 0.00 0.00 0.00 0.00 0.81332 0.80700 1.26300 2.92700 87924.12 ``` ### comparison ``` taskset -c 2,3 memtier_benchmark -s 192.168.1.200 --port 6379 --authenticate perf --cluster-mode --pipeline 10 --data-size 100 --ratio 1:0 --key-pattern P:P --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 25 -t 2 --hide-histogram Writing results to stdout [RUN redis#1] Preparing benchmark client... [RUN redis#1] Launching threads now... [RUN redis#1 100%, 180 secs] 0 threads: 111731310 ops, 610195 (avg: 620707) ops/sec, 85.48MB/sec (avg: 86.95MB/sec), 0.82 (avg: 0.80) msec latency 2 Threads 25 Connections per thread 180 Seconds ALL STATS ====================================================================================================================================================== Type Ops/sec Hits/sec Misses/sec MOVED/sec ASK/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ------------------------------------------------------------------------------------------------------------------------------------------------------ Sets 620707.72 --- --- 0.00 0.00 0.80312 0.79900 1.23900 2.87900 89037.78 Gets 0.00 0.00 0.00 0.00 0.00 --- --- --- --- 0.00 Waits 0.00 --- --- --- --- --- --- --- --- --- Totals 620707.72 0.00 0.00 0.00 0.00 0.80312 0.79900 1.23900 2.87900 89037.78 ``` Co-authored-by: filipecosta90 <[email protected]>
The following PR goes from 33 cacheline on getKeysResult struct (by default has 256 static buffer)
to 1 cacheline with a static buffer of 6 keys per command):
we get around 1.5% higher ops/sec, and a confirmation of around 15% less LLC loads on getNodeByQuery and 37% less Stores.
results on client side
baseline
comparison