Avoid overhead of comparision function pointer calls in lpFind() #13503

sundb · 2024-08-29T09:06:25Z

In #13279 (found by @filipecosta90), for custom lookups, we introduce a comparison function for lpFind() to compare entry, but it also introduces some overhead.

To avoid the overhead of function pointer calls:

Extract the lpFindCb() method into a lpFindCbInternal() method that is easier to inline.
Use unlikely to annotate the comparison method, as can only success once.

sundb · 2024-08-29T09:45:51Z

taskset -c 10-15 ./src/redis-server --save ""

taskset -c 16-20 memtier_benchmark --pipeline 30 --test-time 60 -c 1000 -t 2 -x 3 --hide-histogram --command='HSET h a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8 i 9 j 10 k 11 l 12 m 13 n 14 o 15 p 16 q 17 r 18 s 19 t 20'

before #13279 (95cbe87):

AGGREGATED AVERAGE RESULTS (3 runs)
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Hsets      413235.66       144.76332       150.52700       210.94300       230.39900    127925.49 
Totals     413235.66       144.76332       150.52700       210.94300       230.39900    255850.99

#13279 (a25b153):

AGGREGATED AVERAGE RESULTS (3 runs)
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Hsets      383770.26       156.34110       161.79100       223.23100       240.63900    118803.88 
Totals     383770.26       156.34110       161.79100       223.23100       240.63900    237607.76

this patch with #13279: (still other regissions after #13279)

AGGREGATED AVERAGE RESULTS (3 runs)
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Hsets      409445.97       146.62460       152.57500       212.99100       232.44700    126752.32 
Totals     409445.97       146.62460       152.57500       212.99100       232.44700    253504.64

fcostaoliveira · 2024-08-30T16:59:33Z

CE Performance Automation : step 2 of 2 (benchmark) RUNNING...

This comment was automatically generated given a benchmark was triggered.

Started benchmark suite at 2024-08-30 16:57:57.043980 and took 0 seconds up until now.
Status: [--------------------------------------------------------------------------------] 0.0% completed.

In total will run 107 benchmarks.
- 107 pending.
- 0 completed:
- 0 successful.
- 0 failed.
You can check a the status in detail via the grafana link

fcostaoliveira · 2024-08-30T16:59:33Z

CE Performance Automation : step 2 of 2 (benchmark) RUNNING...

This comment was automatically generated given a benchmark was triggered.

Started benchmark suite at 2024-08-30 16:57:57.218523 and took 0 seconds up until now.
Status: [--------------------------------------------------------------------------------] 0.0% completed.

In total will run 107 benchmarks.
- 107 pending.
- 0 completed:
- 0 successful.
- 0 failed.
You can check a the status in detail via the grafana link

fcostaoliveira · 2024-08-31T01:00:40Z

Automated performance analysis summary

This comment was automatically generated given there is performance data available.

Using platform named: intel64-ubuntu22.04-redis-icx1 to do the comparison.

In summary:

Detected a total of 3 stable tests between versions.
Detected a total of 4 improvements above the improvement water line.
Detected a total of 1 regressions bellow the regression water line 1.5.

You can check a comparison in detail via the grafana link

Comparison between unstable and regression_lpFind.

Time Period from 5 months ago. (environment used: oss-standalone)

Regressions Table

Test Case	Baseline unstable (median obs. +- std.dev)	Comparison regression_lpFind (median obs. +- std.dev)	% change (higher-better)	Note
memtier_benchmark-1Mkeys-load-hash-hmset-5-fields-with-1000B-values	104924 +- 1.7% (7 datapoints)	102620	-2.2%	waterline=1.7%. REGRESSION

Regressions test regexp names: memtier_benchmark-1Mkeys-load-hash-hmset-5-fields-with-1000B-values

Improvements Table

Test Case	Baseline unstable (median obs. +- std.dev)	Comparison regression_lpFind (median obs. +- std.dev)	% change (higher-better)	Note
memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values-pipeline-10	408263 +- 0.7% (7 datapoints)	416819	2.1%	IMPROVEMENT
memtier_benchmark-1key-list-100-elements-lrange-all-elements	101125 +- 0.7% (7 datapoints)	104484	3.3%	IMPROVEMENT
memtier_benchmark-1key-list-1K-elements-lrange-all-elements	17313 +- 2.4% (7 datapoints)	18158	4.9%	waterline=2.4%. IMPROVEMENT
memtier_benchmark-2keys-set-10-100-elements-sinter	82232 +- 1.2% (7 datapoints)	87789	6.8%	IMPROVEMENT

Improvements test regexp names: memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values-pipeline-10|memtier_benchmark-1key-list-100-elements-lrange-all-elements|memtier_benchmark-1key-list-1K-elements-lrange-all-elements|memtier_benchmark-2keys-set-10-100-elements-sinter

Full Results table:

Test Case	Baseline unstable (median obs. +- std.dev)	Comparison regression_lpFind (median obs. +- std.dev)	% change (higher-better)	Note
memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values	123479 +- 3.8% (7 datapoints)	120227	-2.6%	waterline=3.8%. No Change
memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values-pipeline-10	319999 +- 1.1% (7 datapoints)	320004	0.0%	No Change
memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values-pipeline-10	408263 +- 0.7% (7 datapoints)	416819	2.1%	IMPROVEMENT
memtier_benchmark-1Mkeys-load-hash-hmset-5-fields-with-1000B-values	104924 +- 1.7% (7 datapoints)	102620	-2.2%	waterline=1.7%. REGRESSION
memtier_benchmark-1key-list-100-elements-lrange-all-elements	101125 +- 0.7% (7 datapoints)	104484	3.3%	IMPROVEMENT
memtier_benchmark-1key-list-1K-elements-lrange-all-elements	17313 +- 2.4% (7 datapoints)	18158	4.9%	waterline=2.4%. IMPROVEMENT
memtier_benchmark-1key-set-1K-elements-smembers	13542 +- 1.7% (7 datapoints)	13336	-1.5%	waterline=1.7%. No Change
memtier_benchmark-2keys-set-10-100-elements-sinter	82232 +- 1.2% (7 datapoints)	87789	6.8%	IMPROVEMENT

sundb · 2024-09-02T02:39:39Z

Regressions Table

Test Case	Baseline unstable (median obs. +- std.dev)	Comparison regression_lpFind (median obs. +- std.dev)	% change (higher-better)	Note
memtier_benchmark-1Mkeys-load-hash-hmset-5-fields-with-1000B-values	104924 +- 1.7% (7 datapoints)	102620	-2.2%	waterline=1.7%. REGRESSION

about this regression, in theory, this test is hash encoding test, the changes of lpFind() should not affect it.
i tested this PR vs unstable locally and didn't see a obvious regression.
@filipecosta90 can we trigger the benchmark again just for memtier_benchmark-1Mkeys-load-hash-hmset-5-fields-with-1000B-values.

tezc

I wish there was a way to tell compiler what we want to do here. (e.g. force inlining and constprop for lpFind())

Co-authored-by: Ozan Tezcan <[email protected]>

fcostaoliveira · 2024-09-02T07:20:18Z

CE Performance Automation : step 1 of 2 (build) DONE.

This comment was automatically generated given a benchmark was triggered.
Started building at 2024-09-05 05:39:05.483802 and took 86 seconds.
You can check each build/benchmark progress in grafana:

git hash: c430be6
git branch: sundb:regression_lpFind
commit date and time: n/a
commit summary: n/a
test filters:
- command priority lower limit: 0
- command priority upper limit: 10000
- test name regex: .*
- command group regex: .*

You can check a comparison in detail via the grafana link

fcostaoliveira · 2024-09-02T07:21:46Z

CE Performance Automation : step 2 of 2 (benchmark) FINISHED.

This comment was automatically generated given a benchmark was triggered.

Started benchmark suite at 2024-10-20 16:25:51.126566 and took 5014.17387 seconds to finish.
Status: [################################################################################] 100.0% completed.

In total will run 135 benchmarks.
- 0 pending.
- 135 completed:
- 0 successful.
- 135 failed.
You can check a the status in detail via the grafana link

sundb · 2024-09-02T07:22:45Z

@tezc changed as you suggested. cb is correctly inlined from assembly, seems good from the benchmark.

tezc · 2024-09-02T07:40:04Z

Interesting!! Having that internal function makes compiler inline it.

One problem, inline is a hint for compiler. Compilers may or may not take that into account. There is a stronger __attribute__((always_inline)) but let's see if inline will work fine first.

sundb · 2024-09-02T08:13:32Z

taskset -c 10-15 ./src/redis-server --save ""

taskset -c 16-20 memtier_benchmark --pipeline 30 --test-time 180 -c 1000 -t 2 -x 1 --hide-histogram --command='HSET h a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8 i 9 j 10 k 11 l 12 m 13 n 14 o 15 p 16 q 17 r 18 s 19 t 20'

before #13279 (95cbe87):

2         Threads
1000      Connections per thread
180       Seconds


ALL STATS
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Hsets      415770.56       144.24132       149.50300       209.91900       229.37500    128710.22 
Totals     415770.56       144.24132       149.50300       209.91900       229.37500    257420.45

this PR:

2         Threads
1000      Connections per thread
180       Seconds


ALL STATS
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Hsets      421393.52       142.19897       147.45500       206.84700       223.23100    130450.93 
Totals     421393.52       142.19897       147.45500       206.84700       223.23100    260901.85

@tezc faster than before regression.

tezc · 2024-09-02T08:38:22Z

I remember I checked this while adding this function. I thought it was inlining it. I remember adding inline to lpFindCmp and checking the performance. Then, I think it stopped inlining when I added further changes..

Looks like having a static inline function lpFindCbInternal() is the trick to convince compiler to inline it. Great finding!

### New Features in binary distributions - 7 new data structures: JSON, Time series, Bloom filter, Cuckoo filter, Count-min sketch, Top-k, t-digest - Redis scalable query engine (including vector search) ### Potentially breaking changes - #12272 `GETRANGE` returns an empty bulk when the negative end index is out of range - #12395 Optimize `SCAN` command when matching data type ### Bug fixes - #13510 Fix `RM_RdbLoad` to enable AOF after RDB loading is completed - #13489 `ACL CAT` - return module commands - #13476 Fix a race condition in the `cache_memory` of `functionsLibCtx` - #13473 Fix incorrect lag due to trimming stream via `XTRIM` command - #13338 Fix incorrect lag field in `XINFO` when tombstone is after the `last_id` of the consume group - #13470 On `HDEL` of last field - update the global hash field expiration data structure - #13465 Cluster: Pass extensions to node if extension processing is handled by it - #13443 Cluster: Ensure validity of myself when loading cluster config - #13422 Cluster: Fix `CLUSTER SHARDS` command returns empty array ### Modules API - #13509 New API calls: `RM_DefragAllocRaw`, `RM_DefragFreeRaw`, and `RM_RegisterDefragCallbacks` - defrag API to allocate and free raw memory ### Performance and resource utilization improvements - #13503 Avoid overhead of comparison function pointer calls in listpack `lpFind` - #13505 Optimize `STRING` datatype write commands - #13499 Optimize `SMEMBERS` command - #13494 Optimize `GEO*` commands reply - #13490 Optimize `HELLO` command - #13488 Optimize client query buffer - #12395 Optimize `SCAN` command when matching data type - #13529 Optimize `LREM`, `LPOS`, `LINSERT`, and `LINDEX` commands - #13516 Optimize `LRANGE` and other commands that perform several writes to client buffers per call - #13431 Avoid `used_memory` contention when updating from multiple threads ### Other general improvements - #13495 Reply `-LOADING` on replica while flushing the db ### CLI tools - #13411 redis-cli: Fix wrong `dbnum` showed after the client reconnected ### Notes - No backward compatibility for replication or persistence. - Additional distributions, upgrade paths, features, and improvements will be introduced in upcoming pre-releases. - With the GA release of 8.0 we will deprecate Redis Stack.

Avoid overhead of comparision function pointer calls in lpFind()

600a0b4

sundb added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Aug 29, 2024

sundb requested a review from tezc September 2, 2024 04:32

tezc previously approved these changes Sep 2, 2024

View reviewed changes

Suggest the compiler to inline lpFind()

c900996

Co-authored-by: Ozan Tezcan <[email protected]>

sundb dismissed tezc’s stale review via c900996 September 2, 2024 07:20

Add unlikely for the comparison of entry

c430be6

sundb mentioned this pull request Sep 2, 2024

Improve performance by using a static argv array in client structure #12730

Open

tezc approved these changes Sep 3, 2024

View reviewed changes

sundb merged commit de7f2f8 into redis:unstable Sep 3, 2024
14 checks passed

sundb deleted the regression_lpFind branch September 3, 2024 14:53

YaacovHazan added the release-notes indication that this issue needs to be mentioned in the release notes label Sep 5, 2024

YaacovHazan mentioned this pull request Sep 11, 2024

Redis 8.0 M01 #13538

Merged

sundb mentioned this pull request Mar 24, 2025

[BUG] Performance degradation of HMSET command after upgrading to 7.4.2 #13876

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid overhead of comparision function pointer calls in lpFind() #13503

Avoid overhead of comparision function pointer calls in lpFind() #13503

sundb commented Aug 29, 2024 •

edited

Loading

sundb commented Aug 29, 2024

fcostaoliveira commented Aug 30, 2024

fcostaoliveira commented Aug 30, 2024

fcostaoliveira commented Aug 31, 2024

sundb commented Sep 2, 2024 •

edited

Loading

tezc left a comment

fcostaoliveira commented Sep 2, 2024 •

edited

Loading

fcostaoliveira commented Sep 2, 2024 •

edited

Loading

sundb commented Sep 2, 2024

tezc commented Sep 2, 2024

sundb commented Sep 2, 2024

tezc commented Sep 2, 2024

Avoid overhead of comparision function pointer calls in lpFind() #13503

Avoid overhead of comparision function pointer calls in lpFind() #13503

Conversation

sundb commented Aug 29, 2024 • edited Loading

sundb commented Aug 29, 2024

fcostaoliveira commented Aug 30, 2024

CE Performance Automation : step 2 of 2 (benchmark) RUNNING...

fcostaoliveira commented Aug 30, 2024

CE Performance Automation : step 2 of 2 (benchmark) RUNNING...

fcostaoliveira commented Aug 31, 2024

Automated performance analysis summary

Comparison between unstable and regression_lpFind.

Regressions Table

Improvements Table

sundb commented Sep 2, 2024 • edited Loading

Regressions Table

tezc left a comment

Choose a reason for hiding this comment

fcostaoliveira commented Sep 2, 2024 • edited Loading

CE Performance Automation : step 1 of 2 (build) DONE.

fcostaoliveira commented Sep 2, 2024 • edited Loading

CE Performance Automation : step 2 of 2 (benchmark) FINISHED.

sundb commented Sep 2, 2024

tezc commented Sep 2, 2024

sundb commented Sep 2, 2024

tezc commented Sep 2, 2024

sundb commented Aug 29, 2024 •

edited

Loading

sundb commented Sep 2, 2024 •

edited

Loading

fcostaoliveira commented Sep 2, 2024 •

edited

Loading

fcostaoliveira commented Sep 2, 2024 •

edited

Loading