reduce getNodeByQuery CPU time by using less cache lines (from 2064 Bytes struct to 64 Bytes): reduces LLC misses and Memory Loads by fcostaoliveira · Pull Request #13296 · redis/redis

fcostaoliveira · 2024-05-28T00:26:17Z

The following PR goes from 33 cacheline on getKeysResult struct (by default has 256 static buffer)

root@hpe10:~/redis# pahole -p   ./src/server.o -C getKeysResult
typedef struct {
	keyReference               keysbuf[256];         /*     0  2048 */
	/* --- cacheline 32 boundary (2048 bytes) --- */
	/* typedef keyReference */ struct {
		int                pos;
		int                flags;
	} *keys; /*  2048     8 */
	int                        numkeys;              /*  2056     4 */
	int                        size;                 /*  2060     4 */

	/* size: 2064, cachelines: 33, members: 4 */
	/* last cacheline: 16 bytes */
} getKeysResult;

to 1 cacheline with a static buffer of 6 keys per command):

root@hpe10:~/redis# pahole -p   ./src/server.o -C getKeysResult
typedef struct {
	int                        numkeys;              /*     0     4 */
	int                        size;                 /*     4     4 */
	keyReference               keysbuf[6];           /*     8    48 */
	/* typedef keyReference */ struct {
		int                pos;
		int                flags;
	} *keys; /*    56     8 */

	/* size: 64, cachelines: 1, members: 4 */
} getKeysResult;

we get around 1.5% higher ops/sec, and a confirmation of around 15% less LLC loads on getNodeByQuery and 37% less Stores.

Function / Call Stack	CPU Time: Difference	CPU Time: `9462436`	CPU Time: this PR	Loads: Difference	Loads: `9462436`	Loads: this PR	Stores: Difference	Stores: `9462436`	Stores: This PR
getNodeByQuery	0.753767	1.57118	0.817416	144297829 (15% less loads)	920575969	776278140	367607824 (37% less stores)	991642384	624034560

results on client side

baseline

taskset -c 2,3 memtier_benchmark -s 192.168.1.200 --port 6379 --authenticate perf --cluster-mode --pipeline 10 --data-size 100 --ratio 1:0 --key-pattern P:P --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 25 -t 2 --hide-histogram 
Writing results to stdout
[RUN #1] Preparing benchmark client...
[RUN #1] Launching threads now...
[RUN #1 100%, 180 secs]  0 threads:   110333450 ops,  604992 (avg:  612942) ops/sec, 84.75MB/sec (avg: 85.86MB/sec),  0.82 (avg:  0.81) msec latency

2         Threads
25        Connections per thread
180       Seconds


ALL STATS
======================================================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    MOVED/sec      ASK/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
------------------------------------------------------------------------------------------------------------------------------------------------------
Sets       612942.14          ---          ---         0.00         0.00         0.81332         0.80700         1.26300         2.92700     87924.12 
Gets            0.00         0.00         0.00         0.00         0.00             ---             ---             ---             ---         0.00 
Waits           0.00          ---          ---          ---          ---             ---             ---             ---             ---          --- 
Totals     612942.14         0.00         0.00         0.00         0.00         0.81332         0.80700         1.26300         2.92700     87924.12

comparison

taskset -c 2,3 memtier_benchmark -s 192.168.1.200 --port 6379 --authenticate perf --cluster-mode --pipeline 10 --data-size 100 --ratio 1:0 --key-pattern P:P --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 25 -t 2 --hide-histogram 
Writing results to stdout
[RUN #1] Preparing benchmark client...
[RUN #1] Launching threads now...
[RUN #1 100%, 180 secs]  0 threads:   111731310 ops,  610195 (avg:  620707) ops/sec, 85.48MB/sec (avg: 86.95MB/sec),  0.82 (avg:  0.80) msec latency

2         Threads
25        Connections per thread
180       Seconds


ALL STATS
======================================================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    MOVED/sec      ASK/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
------------------------------------------------------------------------------------------------------------------------------------------------------
Sets       620707.72          ---          ---         0.00         0.00         0.80312         0.79900         1.23900         2.87900     89037.78 
Gets            0.00         0.00         0.00         0.00         0.00             ---             ---             ---             ---         0.00 
Waits           0.00          ---          ---          ---          ---             ---             ---             ---             ---          --- 
Totals     620707.72         0.00         0.00         0.00         0.00         0.80312         0.79900         1.23900         2.87900     89037.78

sundb

LGTM, the only concern is whether is might affect commands like command getkeys?
when there are more than 6 arguments we will allocate memory for it each time.

sundb · 2024-06-26T06:53:01Z

A benchmark test was done when the number of keys was greater than 6.
As we can see from the report below, we still benefit from this PR when the number of Keys is greater than 6.
@filipecosta90 WDYT?

7 keys
./src/redis-benchmark -n 20000000 -P 100 command getkeys mget a b c d e f g

#13296 (commit: 24c85cc36807a354361bdf273dad4f48eaa5ffca)
  throughput summary: 2039775.50 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.136     0.512     2.111     2.719     2.935     4.215


before #13296 (4aa25d042cff5ba63f244a752887bf4be369c937)
  throughput summary: 1812086.62 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.445     0.480     2.423     3.039     3.215    11.399


100 keys
./src/redis-benchmark -n 3000000 -P 100 command getkeys mget key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 key19 key20 key21 key22 key23 key24 key25 key26 key27 key28 key29 key30 key31 key32 key33 key34 key35 key36 key37 key38 key39 key40 key41 key42 key43 key44 key45 key46 key47 key48 key49 key50 key51 key52 key53 key54 key55 key56 key57 key58 key59 key60 key61 key62 key63 key64 key65 key66 key67 key68 key69 key70 key71 key72 key73 key74 key75 key76 key77 key78 key79 key80 key81 key82 key83 key84 key85 key86 key87 key88 key89 key90 key91 key92 key93 key94 key95 key96 key97 key98 key99 key100
#13296 (24c85cc36807a354361bdf273dad4f48eaa5ffca)
  throughput summary: 221713.09 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        4.230     1.016     4.407     6.223     6.647    16.159

before #13296 (4aa25d042cff5ba63f244a752887bf4be369c937)
  throughput summary: 202565.83 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        6.792     1.144     7.895     9.631    10.119    14.327

250 keys
 ./src/redis-benchmark -n 1000000 -P 100 command getkeys mget key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 key19 key20 key21 key22 key23 key24 key25 key26 key27 key28 key29 key30 key31 key32 key33 key34 key35 key36 key37 key38 key39 key40 key41 key42 key43 key44 key45 key46 key47 key48 key49 key50 key51 key52 key53 key54 key55 key56 key57 key58 key59 key60 key61 key62 key63 key64 key65 key66 key67 key68 key69 key70 key71 key72 key73 key74 key75 key76 key77 key78 key79 key80 key81 key82 key83 key84 key85 key86 key87 key88 key89 key90 key91 key92 key93 key94 key95 key96 key97 key98 key99 key100 key101 key102 key103 key104 key105 key106 key107 key108 key109 key110 key111 key112 key113 key114 key115 key116 key117 key118 key119 key120 key121 key122 key123 key124 key125 key126 key127 key128 key129 key130 key131 key132 key133 key134 key135 key136 key137 key138 key139 key140 key141 key142 key143 key144 key145 key146 key147 key148 key149 key150 key151 key152 key153 key154 key155 key156 key157 key158 key159 key160 key161 key162 key163 key164 key165 key166 key167 key168 key169 key170 key171 key172 key173 key174 key175 key176 key177 key178 key179 key180 key181 key182 key183 key184 key185 key186 key187 key188 key189 key190 key191 key192 key193 key194 key195 key196 key197 key198 key199 key200 key201 key202 key203 key204 key205 key206 key207 key208 key209 key210 key211 key212 key213 key214 key215 key216 key217 key218 key219 key220 key221 key222 key223 key224 key225 key226 key227 key228 key229 key230 key231 key232 key233 key234 key235 key236 key237 key238 key239 key240 key241 key242 key243 key244 key245 key246 key247 key248 key249 key250
#13296 (24c85cc36807a354361bdf273dad4f48eaa5ffca) 
  throughput summary: 84253.09 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        4.115     2.848     4.015     4.863     6.711    10.447

before #13296 (4aa25d042cff5ba63f244a752887bf4be369c937)
  throughput summary: 69352.94 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        8.216     3.032     8.623    10.463    11.887    15.239

fcostaoliveira · 2024-06-26T12:41:27Z

@sundb take your first example (./src/redis-benchmark -n 20000000 -P 100 command getkeys mget a b c d e f g
) we can see that even when we surpass the 6 keys, the overall decrRefCount/sdslen before this PR is worse than after it. TLDR the price we pay for the large struct before this change, is way larger than the added overhead on this scenarios.

out of a collection of 15secs, there is 1sec difference just on the bellow 2 functions.

Function / Call Stack	CPU Time: Difference	CPU Time: `4aa25d0`	CPU Time: `24c85cc`	Clockticks: Difference	CPI Rate: Difference	CPI Rate: `4aa25d0`	CPI Rate: `24c85cc`	Bad Speculation: Difference	Bad Speculation: `4aa25d0`	Bad Speculation: `24c85cc`
zfree->decrRefCount	388.009ms	1.021s	632.883ms	1,088,010,000	0.192	0.486	0.294	11.30%	11.30%	0.00%
sdslen	322.556ms	0.323s	0.200ms	873,810,000	0.526	0.526		-64.30%	35.70%	100.00%

Upgrade urgency LOW: This is the second Release Candidate for Redis 7.4. Performance and resource utilization improvements ================================================= * #13296 Optimize CPU cache efficiency Changes to new 7.4 new features (compared to 7.4 RC1) ===================================================== * #13343 Hash - expiration of individual fields: when key does not exist - reply with an array (nonexisting code for each field) * #13329 Hash - expiration of individual fields: new keyspace event: `hexpired` Modules API - Potentially breaking changes to new 7.4 features (compared to 7.4 RC1) ==================================================================================== * #13326 Hash - expiration of individual fields: avoid lazy expire when called from a Modules API function

…ytes struct to 64 Bytes): reduces LLC misses and Memory Loads (redis#13296) The following PR goes from 33 cacheline on getKeysResult struct (by default has 256 static buffer) ``` root@hpe10:~/redis# pahole -p ./src/server.o -C getKeysResult typedef struct { keyReference keysbuf[256]; /* 0 2048 */ /* --- cacheline 32 boundary (2048 bytes) --- */ /* typedef keyReference */ struct { int pos; int flags; } *keys; /* 2048 8 */ int numkeys; /* 2056 4 */ int size; /* 2060 4 */ /* size: 2064, cachelines: 33, members: 4 */ /* last cacheline: 16 bytes */ } getKeysResult; ``` to 1 cacheline with a static buffer of 6 keys per command): ``` root@hpe10:~/redis# pahole -p ./src/server.o -C getKeysResult typedef struct { int numkeys; /* 0 4 */ int size; /* 4 4 */ keyReference keysbuf[6]; /* 8 48 */ /* typedef keyReference */ struct { int pos; int flags; } *keys; /* 56 8 */ /* size: 64, cachelines: 1, members: 4 */ } getKeysResult; ``` we get around 1.5% higher ops/sec, and a confirmation of around 15% less LLC loads on getNodeByQuery and 37% less Stores. Function / Call Stack | CPU Time: Difference | CPU Time: 9462436 | CPU Time: this PR | Loads: Difference | Loads: 9462436 | Loads: this PR | Stores: Difference | Stores: 9462436 | Stores: This PR -- | -- | -- | -- | -- | -- | -- | -- | -- | -- getNodeByQuery | 0.753767 | 1.57118 | 0.817416 | 144297829 (15% less loads) | 920575969 | 776278140 | 367607824 (37% less stores) | 991642384 | 624034560 ## results on client side ### baseline ``` taskset -c 2,3 memtier_benchmark -s 192.168.1.200 --port 6379 --authenticate perf --cluster-mode --pipeline 10 --data-size 100 --ratio 1:0 --key-pattern P:P --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 25 -t 2 --hide-histogram Writing results to stdout [RUN redis#1] Preparing benchmark client... [RUN redis#1] Launching threads now... [RUN redis#1 100%, 180 secs] 0 threads: 110333450 ops, 604992 (avg: 612942) ops/sec, 84.75MB/sec (avg: 85.86MB/sec), 0.82 (avg: 0.81) msec latency 2 Threads 25 Connections per thread 180 Seconds ALL STATS ====================================================================================================================================================== Type Ops/sec Hits/sec Misses/sec MOVED/sec ASK/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ------------------------------------------------------------------------------------------------------------------------------------------------------ Sets 612942.14 --- --- 0.00 0.00 0.81332 0.80700 1.26300 2.92700 87924.12 Gets 0.00 0.00 0.00 0.00 0.00 --- --- --- --- 0.00 Waits 0.00 --- --- --- --- --- --- --- --- --- Totals 612942.14 0.00 0.00 0.00 0.00 0.81332 0.80700 1.26300 2.92700 87924.12 ``` ### comparison ``` taskset -c 2,3 memtier_benchmark -s 192.168.1.200 --port 6379 --authenticate perf --cluster-mode --pipeline 10 --data-size 100 --ratio 1:0 --key-pattern P:P --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 25 -t 2 --hide-histogram Writing results to stdout [RUN redis#1] Preparing benchmark client... [RUN redis#1] Launching threads now... [RUN redis#1 100%, 180 secs] 0 threads: 111731310 ops, 610195 (avg: 620707) ops/sec, 85.48MB/sec (avg: 86.95MB/sec), 0.82 (avg: 0.80) msec latency 2 Threads 25 Connections per thread 180 Seconds ALL STATS ====================================================================================================================================================== Type Ops/sec Hits/sec Misses/sec MOVED/sec ASK/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ------------------------------------------------------------------------------------------------------------------------------------------------------ Sets 620707.72 --- --- 0.00 0.00 0.80312 0.79900 1.23900 2.87900 89037.78 Gets 0.00 0.00 0.00 0.00 0.00 --- --- --- --- 0.00 Waits 0.00 --- --- --- --- --- --- --- --- --- Totals 620707.72 0.00 0.00 0.00 0.00 0.80312 0.79900 1.23900 2.87900 89037.78 ``` Co-authored-by: filipecosta90 <[email protected]>

Move from 33 cachelines to 1 cacheline for the getKeysResult struct

9462436

filipecosta90 force-pushed the cacheline.getKeysResult branch from 04ce2f5 to 9462436 Compare May 28, 2024 00:28

sundb approved these changes Jun 12, 2024

View reviewed changes

sundb merged commit 24c85cc into redis:unstable Jun 18, 2024

YaacovHazan mentioned this pull request Jun 27, 2024

Redis 7.4 RC2 #13371

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce getNodeByQuery CPU time by using less cache lines (from 2064 Bytes struct to 64 Bytes): reduces LLC misses and Memory Loads#13296

reduce getNodeByQuery CPU time by using less cache lines (from 2064 Bytes struct to 64 Bytes): reduces LLC misses and Memory Loads#13296
sundb merged 1 commit intoredis:unstablefrom
filipecosta90:cacheline.getKeysResult

fcostaoliveira commented May 28, 2024 •

edited

Loading

Uh oh!

sundb left a comment

Uh oh!

sundb commented Jun 26, 2024

Uh oh!

fcostaoliveira commented Jun 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fcostaoliveira commented May 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

results on client side

baseline

comparison

Uh oh!

sundb left a comment

Choose a reason for hiding this comment

Uh oh!

sundb commented Jun 26, 2024

Uh oh!

fcostaoliveira commented Jun 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fcostaoliveira commented May 28, 2024 •

edited

Loading