Skip to content

reduce getNodeByQuery CPU time by using less cache lines (from 2064 Bytes struct to 64 Bytes): reduces LLC misses and Memory Loads#13296

Merged
sundb merged 1 commit intoredis:unstablefrom
filipecosta90:cacheline.getKeysResult
Jun 18, 2024
Merged

reduce getNodeByQuery CPU time by using less cache lines (from 2064 Bytes struct to 64 Bytes): reduces LLC misses and Memory Loads#13296
sundb merged 1 commit intoredis:unstablefrom
filipecosta90:cacheline.getKeysResult

Conversation

@fcostaoliveira
Copy link
Collaborator

@fcostaoliveira fcostaoliveira commented May 28, 2024

The following PR goes from 33 cacheline on getKeysResult struct (by default has 256 static buffer)

root@hpe10:~/redis# pahole -p   ./src/server.o -C getKeysResult
typedef struct {
	keyReference               keysbuf[256];         /*     0  2048 */
	/* --- cacheline 32 boundary (2048 bytes) --- */
	/* typedef keyReference */ struct {
		int                pos;
		int                flags;
	} *keys; /*  2048     8 */
	int                        numkeys;              /*  2056     4 */
	int                        size;                 /*  2060     4 */

	/* size: 2064, cachelines: 33, members: 4 */
	/* last cacheline: 16 bytes */
} getKeysResult;

to 1 cacheline with a static buffer of 6 keys per command):

root@hpe10:~/redis# pahole -p   ./src/server.o -C getKeysResult
typedef struct {
	int                        numkeys;              /*     0     4 */
	int                        size;                 /*     4     4 */
	keyReference               keysbuf[6];           /*     8    48 */
	/* typedef keyReference */ struct {
		int                pos;
		int                flags;
	} *keys; /*    56     8 */

	/* size: 64, cachelines: 1, members: 4 */
} getKeysResult; 

we get around 1.5% higher ops/sec, and a confirmation of around 15% less LLC loads on getNodeByQuery and 37% less Stores.

Function / Call Stack CPU Time: Difference CPU Time: 9462436 CPU Time: this PR Loads: Difference Loads: 9462436 Loads: this PR Stores: Difference Stores: 9462436 Stores: This PR
getNodeByQuery 0.753767 1.57118 0.817416 144297829 (15% less loads) 920575969 776278140 367607824 (37% less stores) 991642384 624034560

results on client side

baseline

taskset -c 2,3 memtier_benchmark -s 192.168.1.200 --port 6379 --authenticate perf --cluster-mode --pipeline 10 --data-size 100 --ratio 1:0 --key-pattern P:P --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 25 -t 2 --hide-histogram 
Writing results to stdout
[RUN #1] Preparing benchmark client...
[RUN #1] Launching threads now...
[RUN #1 100%, 180 secs]  0 threads:   110333450 ops,  604992 (avg:  612942) ops/sec, 84.75MB/sec (avg: 85.86MB/sec),  0.82 (avg:  0.81) msec latency

2         Threads
25        Connections per thread
180       Seconds


ALL STATS
======================================================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    MOVED/sec      ASK/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
------------------------------------------------------------------------------------------------------------------------------------------------------
Sets       612942.14          ---          ---         0.00         0.00         0.81332         0.80700         1.26300         2.92700     87924.12 
Gets            0.00         0.00         0.00         0.00         0.00             ---             ---             ---             ---         0.00 
Waits           0.00          ---          ---          ---          ---             ---             ---             ---             ---          --- 
Totals     612942.14         0.00         0.00         0.00         0.00         0.81332         0.80700         1.26300         2.92700     87924.12 

comparison

taskset -c 2,3 memtier_benchmark -s 192.168.1.200 --port 6379 --authenticate perf --cluster-mode --pipeline 10 --data-size 100 --ratio 1:0 --key-pattern P:P --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 25 -t 2 --hide-histogram 
Writing results to stdout
[RUN #1] Preparing benchmark client...
[RUN #1] Launching threads now...
[RUN #1 100%, 180 secs]  0 threads:   111731310 ops,  610195 (avg:  620707) ops/sec, 85.48MB/sec (avg: 86.95MB/sec),  0.82 (avg:  0.80) msec latency

2         Threads
25        Connections per thread
180       Seconds


ALL STATS
======================================================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    MOVED/sec      ASK/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
------------------------------------------------------------------------------------------------------------------------------------------------------
Sets       620707.72          ---          ---         0.00         0.00         0.80312         0.79900         1.23900         2.87900     89037.78 
Gets            0.00         0.00         0.00         0.00         0.00             ---             ---             ---             ---         0.00 
Waits           0.00          ---          ---          ---          ---             ---             ---             ---             ---          --- 
Totals     620707.72         0.00         0.00         0.00         0.00         0.80312         0.79900         1.23900         2.87900     89037.78

@filipecosta90 filipecosta90 force-pushed the cacheline.getKeysResult branch from 04ce2f5 to 9462436 Compare May 28, 2024 00:28
Copy link
Collaborator

@sundb sundb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the only concern is whether is might affect commands like command getkeys?
when there are more than 6 arguments we will allocate memory for it each time.

@sundb sundb merged commit 24c85cc into redis:unstable Jun 18, 2024
@sundb
Copy link
Collaborator

sundb commented Jun 26, 2024

A benchmark test was done when the number of keys was greater than 6.
As we can see from the report below, we still benefit from this PR when the number of Keys is greater than 6.
@filipecosta90 WDYT?

7 keys
./src/redis-benchmark -n 20000000 -P 100 command getkeys mget a b c d e f g

#13296 (commit: 24c85cc36807a354361bdf273dad4f48eaa5ffca)
  throughput summary: 2039775.50 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.136     0.512     2.111     2.719     2.935     4.215


before #13296 (4aa25d042cff5ba63f244a752887bf4be369c937)
  throughput summary: 1812086.62 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.445     0.480     2.423     3.039     3.215    11.399


100 keys
./src/redis-benchmark -n 3000000 -P 100 command getkeys mget key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 key19 key20 key21 key22 key23 key24 key25 key26 key27 key28 key29 key30 key31 key32 key33 key34 key35 key36 key37 key38 key39 key40 key41 key42 key43 key44 key45 key46 key47 key48 key49 key50 key51 key52 key53 key54 key55 key56 key57 key58 key59 key60 key61 key62 key63 key64 key65 key66 key67 key68 key69 key70 key71 key72 key73 key74 key75 key76 key77 key78 key79 key80 key81 key82 key83 key84 key85 key86 key87 key88 key89 key90 key91 key92 key93 key94 key95 key96 key97 key98 key99 key100
#13296 (24c85cc36807a354361bdf273dad4f48eaa5ffca)
  throughput summary: 221713.09 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        4.230     1.016     4.407     6.223     6.647    16.159

before #13296 (4aa25d042cff5ba63f244a752887bf4be369c937)
  throughput summary: 202565.83 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        6.792     1.144     7.895     9.631    10.119    14.327

250 keys
 ./src/redis-benchmark -n 1000000 -P 100 command getkeys mget key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 key16 key17 key18 key19 key20 key21 key22 key23 key24 key25 key26 key27 key28 key29 key30 key31 key32 key33 key34 key35 key36 key37 key38 key39 key40 key41 key42 key43 key44 key45 key46 key47 key48 key49 key50 key51 key52 key53 key54 key55 key56 key57 key58 key59 key60 key61 key62 key63 key64 key65 key66 key67 key68 key69 key70 key71 key72 key73 key74 key75 key76 key77 key78 key79 key80 key81 key82 key83 key84 key85 key86 key87 key88 key89 key90 key91 key92 key93 key94 key95 key96 key97 key98 key99 key100 key101 key102 key103 key104 key105 key106 key107 key108 key109 key110 key111 key112 key113 key114 key115 key116 key117 key118 key119 key120 key121 key122 key123 key124 key125 key126 key127 key128 key129 key130 key131 key132 key133 key134 key135 key136 key137 key138 key139 key140 key141 key142 key143 key144 key145 key146 key147 key148 key149 key150 key151 key152 key153 key154 key155 key156 key157 key158 key159 key160 key161 key162 key163 key164 key165 key166 key167 key168 key169 key170 key171 key172 key173 key174 key175 key176 key177 key178 key179 key180 key181 key182 key183 key184 key185 key186 key187 key188 key189 key190 key191 key192 key193 key194 key195 key196 key197 key198 key199 key200 key201 key202 key203 key204 key205 key206 key207 key208 key209 key210 key211 key212 key213 key214 key215 key216 key217 key218 key219 key220 key221 key222 key223 key224 key225 key226 key227 key228 key229 key230 key231 key232 key233 key234 key235 key236 key237 key238 key239 key240 key241 key242 key243 key244 key245 key246 key247 key248 key249 key250
#13296 (24c85cc36807a354361bdf273dad4f48eaa5ffca) 
  throughput summary: 84253.09 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        4.115     2.848     4.015     4.863     6.711    10.447

before #13296 (4aa25d042cff5ba63f244a752887bf4be369c937)
  throughput summary: 69352.94 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        8.216     3.032     8.623    10.463    11.887    15.239

@fcostaoliveira
Copy link
Collaborator Author

@sundb take your first example (./src/redis-benchmark -n 20000000 -P 100 command getkeys mget a b c d e f g
) we can see that even when we surpass the 6 keys, the overall decrRefCount/sdslen before this PR is worse than after it. TLDR the price we pay for the large struct before this change, is way larger than the added overhead on this scenarios.

out of a collection of 15secs, there is 1sec difference just on the bellow 2 functions.

Function / Call Stack CPU Time: Difference CPU Time: 4aa25d0 CPU Time: 24c85cc Clockticks: Difference CPI Rate: Difference CPI Rate: 4aa25d0 CPI Rate: 24c85cc Bad Speculation: Difference Bad Speculation: 4aa25d0 Bad Speculation: 24c85cc
zfree->decrRefCount 388.009ms 1.021s 632.883ms 1,088,010,000 0.192 0.486 0.294 11.30% 11.30% 0.00%
sdslen 322.556ms 0.323s 0.200ms 873,810,000 0.526 0.526   -64.30% 35.70% 100.00%

@YaacovHazan YaacovHazan mentioned this pull request Jun 27, 2024
YaacovHazan added a commit that referenced this pull request Jun 27, 2024
Upgrade urgency LOW: This is the second Release Candidate for Redis 7.4.

Performance and resource utilization improvements
=================================================
* #13296 Optimize CPU cache efficiency

Changes to new 7.4 new features (compared to 7.4 RC1)
=====================================================
* #13343 Hash - expiration of individual fields: when key does not exist
- reply with an array (nonexisting code for each field)
* #13329 Hash - expiration of individual fields: new keyspace event:
`hexpired`

Modules API - Potentially breaking changes to new 7.4 features (compared
to 7.4 RC1)

====================================================================================
* #13326 Hash - expiration of individual fields: avoid lazy expire when
called from a Modules API function
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
…ytes struct to 64 Bytes): reduces LLC misses and Memory Loads (redis#13296)

The following PR goes from 33 cacheline on getKeysResult struct (by
default has 256 static buffer)

```
root@hpe10:~/redis# pahole -p   ./src/server.o -C getKeysResult
typedef struct {
	keyReference               keysbuf[256];         /*     0  2048 */
	/* --- cacheline 32 boundary (2048 bytes) --- */
	/* typedef keyReference */ struct {
		int                pos;
		int                flags;
	} *keys; /*  2048     8 */
	int                        numkeys;              /*  2056     4 */
	int                        size;                 /*  2060     4 */

	/* size: 2064, cachelines: 33, members: 4 */
	/* last cacheline: 16 bytes */
} getKeysResult;
```


to 1 cacheline with a static buffer of 6 keys per command):
```
root@hpe10:~/redis# pahole -p   ./src/server.o -C getKeysResult
typedef struct {
	int                        numkeys;              /*     0     4 */
	int                        size;                 /*     4     4 */
	keyReference               keysbuf[6];           /*     8    48 */
	/* typedef keyReference */ struct {
		int                pos;
		int                flags;
	} *keys; /*    56     8 */

	/* size: 64, cachelines: 1, members: 4 */
} getKeysResult; 
```

we get around 1.5% higher ops/sec, and a confirmation of around 15% less
LLC loads on getNodeByQuery and 37% less Stores.

Function / Call Stack | CPU Time: Difference | CPU Time:
9462436 | CPU Time: this PR | Loads:
Difference | Loads: 9462436 | Loads:
this PR | Stores: Difference | Stores:
9462436 | Stores: This PR
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
getNodeByQuery | 0.753767 | 1.57118 | 0.817416 | 144297829 (15% less
loads) | 920575969 | 776278140 | 367607824 (37% less stores) | 991642384
| 624034560

## results on client side

### baseline 
```
taskset -c 2,3 memtier_benchmark -s 192.168.1.200 --port 6379 --authenticate perf --cluster-mode --pipeline 10 --data-size 100 --ratio 1:0 --key-pattern P:P --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 25 -t 2 --hide-histogram 
Writing results to stdout
[RUN redis#1] Preparing benchmark client...
[RUN redis#1] Launching threads now...
[RUN redis#1 100%, 180 secs]  0 threads:   110333450 ops,  604992 (avg:  612942) ops/sec, 84.75MB/sec (avg: 85.86MB/sec),  0.82 (avg:  0.81) msec latency

2         Threads
25        Connections per thread
180       Seconds


ALL STATS
======================================================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    MOVED/sec      ASK/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
------------------------------------------------------------------------------------------------------------------------------------------------------
Sets       612942.14          ---          ---         0.00         0.00         0.81332         0.80700         1.26300         2.92700     87924.12 
Gets            0.00         0.00         0.00         0.00         0.00             ---             ---             ---             ---         0.00 
Waits           0.00          ---          ---          ---          ---             ---             ---             ---             ---          --- 
Totals     612942.14         0.00         0.00         0.00         0.00         0.81332         0.80700         1.26300         2.92700     87924.12 
```

### comparison 
```
taskset -c 2,3 memtier_benchmark -s 192.168.1.200 --port 6379 --authenticate perf --cluster-mode --pipeline 10 --data-size 100 --ratio 1:0 --key-pattern P:P --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 25 -t 2 --hide-histogram 
Writing results to stdout
[RUN redis#1] Preparing benchmark client...
[RUN redis#1] Launching threads now...
[RUN redis#1 100%, 180 secs]  0 threads:   111731310 ops,  610195 (avg:  620707) ops/sec, 85.48MB/sec (avg: 86.95MB/sec),  0.82 (avg:  0.80) msec latency

2         Threads
25        Connections per thread
180       Seconds


ALL STATS
======================================================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    MOVED/sec      ASK/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
------------------------------------------------------------------------------------------------------------------------------------------------------
Sets       620707.72          ---          ---         0.00         0.00         0.80312         0.79900         1.23900         2.87900     89037.78 
Gets            0.00         0.00         0.00         0.00         0.00             ---             ---             ---             ---         0.00 
Waits           0.00          ---          ---          ---          ---             ---             ---             ---             ---          --- 
Totals     620707.72         0.00         0.00         0.00         0.00         0.80312         0.79900         1.23900         2.87900     89037.78
```

Co-authored-by: filipecosta90 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants