Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid overhead of comparision function pointer calls in lpFind() #13503

Merged
merged 3 commits into from
Sep 3, 2024

Conversation

sundb
Copy link
Collaborator

@sundb sundb commented Aug 29, 2024

In #13279 (found by @filipecosta90), for custom lookups, we introduce a comparison function for lpFind() to compare entry, but it also introduces some overhead.

To avoid the overhead of function pointer calls:

  1. Extract the lpFindCb() method into a lpFindCbInternal() method that is easier to inline.
  2. Use unlikely to annotate the comparison method, as can only success once.

@sundb
Copy link
Collaborator Author

sundb commented Aug 29, 2024

taskset -c 10-15 ./src/redis-server --save ""
taskset -c 16-20 memtier_benchmark --pipeline 30 --test-time 60 -c 1000 -t 2 -x 3 --hide-histogram --command='HSET h a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8 i 9 j 10 k 11 l 12 m 13 n 14 o 15 p 16 q 17 r 18 s 19 t 20'

before #13279 (95cbe87):

AGGREGATED AVERAGE RESULTS (3 runs)
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Hsets      413235.66       144.76332       150.52700       210.94300       230.39900    127925.49 
Totals     413235.66       144.76332       150.52700       210.94300       230.39900    255850.99

#13279 (a25b153):

AGGREGATED AVERAGE RESULTS (3 runs)
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Hsets      383770.26       156.34110       161.79100       223.23100       240.63900    118803.88 
Totals     383770.26       156.34110       161.79100       223.23100       240.63900    237607.76

this patch with #13279: (still other regissions after #13279)

AGGREGATED AVERAGE RESULTS (3 runs)
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Hsets      409445.97       146.62460       152.57500       212.99100       232.44700    126752.32 
Totals     409445.97       146.62460       152.57500       212.99100       232.44700    253504.64

@sundb sundb added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Aug 29, 2024
@fcostaoliveira
Copy link
Collaborator

CE Performance Automation : step 2 of 2 (benchmark) RUNNING...

This comment was automatically generated given a benchmark was triggered.

Started benchmark suite at 2024-08-30 16:57:57.043980 and took 0 seconds up until now.
Status: [--------------------------------------------------------------------------------] 0.0% completed.

In total will run 107 benchmarks.
- 107 pending.
- 0 completed:
- 0 successful.
- 0 failed.
You can check a the status in detail via the grafana link

@fcostaoliveira
Copy link
Collaborator

CE Performance Automation : step 2 of 2 (benchmark) RUNNING...

This comment was automatically generated given a benchmark was triggered.

Started benchmark suite at 2024-08-30 16:57:57.218523 and took 0 seconds up until now.
Status: [--------------------------------------------------------------------------------] 0.0% completed.

In total will run 107 benchmarks.
- 107 pending.
- 0 completed:
- 0 successful.
- 0 failed.
You can check a the status in detail via the grafana link

@fcostaoliveira
Copy link
Collaborator

Automated performance analysis summary

This comment was automatically generated given there is performance data available.

Using platform named: intel64-ubuntu22.04-redis-icx1 to do the comparison.

In summary:

  • Detected a total of 3 stable tests between versions.
  • Detected a total of 4 improvements above the improvement water line.
  • Detected a total of 1 regressions bellow the regression water line 1.5.

You can check a comparison in detail via the grafana link

Comparison between unstable and regression_lpFind.

Time Period from 5 months ago. (environment used: oss-standalone)

Regressions Table

Test Case Baseline unstable (median obs. +- std.dev) Comparison regression_lpFind (median obs. +- std.dev) % change (higher-better) Note
memtier_benchmark-1Mkeys-load-hash-hmset-5-fields-with-1000B-values 104924 +- 1.7% (7 datapoints) 102620 -2.2% waterline=1.7%. REGRESSION

Regressions test regexp names: memtier_benchmark-1Mkeys-load-hash-hmset-5-fields-with-1000B-values

Improvements Table

Test Case Baseline unstable (median obs. +- std.dev) Comparison regression_lpFind (median obs. +- std.dev) % change (higher-better) Note
memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values-pipeline-10 408263 +- 0.7% (7 datapoints) 416819 2.1% IMPROVEMENT
memtier_benchmark-1key-list-100-elements-lrange-all-elements 101125 +- 0.7% (7 datapoints) 104484 3.3% IMPROVEMENT
memtier_benchmark-1key-list-1K-elements-lrange-all-elements 17313 +- 2.4% (7 datapoints) 18158 4.9% waterline=2.4%. IMPROVEMENT
memtier_benchmark-2keys-set-10-100-elements-sinter 82232 +- 1.2% (7 datapoints) 87789 6.8% IMPROVEMENT

Improvements test regexp names: memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values-pipeline-10|memtier_benchmark-1key-list-100-elements-lrange-all-elements|memtier_benchmark-1key-list-1K-elements-lrange-all-elements|memtier_benchmark-2keys-set-10-100-elements-sinter

Full Results table:
Test Case Baseline unstable (median obs. +- std.dev) Comparison regression_lpFind (median obs. +- std.dev) % change (higher-better) Note
memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values 123479 +- 3.8% (7 datapoints) 120227 -2.6% waterline=3.8%. No Change
memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values-pipeline-10 319999 +- 1.1% (7 datapoints) 320004 0.0% No Change
memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values-pipeline-10 408263 +- 0.7% (7 datapoints) 416819 2.1% IMPROVEMENT
memtier_benchmark-1Mkeys-load-hash-hmset-5-fields-with-1000B-values 104924 +- 1.7% (7 datapoints) 102620 -2.2% waterline=1.7%. REGRESSION
memtier_benchmark-1key-list-100-elements-lrange-all-elements 101125 +- 0.7% (7 datapoints) 104484 3.3% IMPROVEMENT
memtier_benchmark-1key-list-1K-elements-lrange-all-elements 17313 +- 2.4% (7 datapoints) 18158 4.9% waterline=2.4%. IMPROVEMENT
memtier_benchmark-1key-set-1K-elements-smembers 13542 +- 1.7% (7 datapoints) 13336 -1.5% waterline=1.7%. No Change
memtier_benchmark-2keys-set-10-100-elements-sinter 82232 +- 1.2% (7 datapoints) 87789 6.8% IMPROVEMENT

@sundb
Copy link
Collaborator Author

sundb commented Sep 2, 2024

Regressions Table

Test Case Baseline unstable (median obs. +- std.dev) Comparison regression_lpFind (median obs. +- std.dev) % change (higher-better) Note
memtier_benchmark-1Mkeys-load-hash-hmset-5-fields-with-1000B-values 104924 +- 1.7% (7 datapoints) 102620 -2.2% waterline=1.7%. REGRESSION

about this regression, in theory, this test is hash encoding test, the changes of lpFind() should not affect it.
i tested this PR vs unstable locally and didn't see a obvious regression.
@filipecosta90 can we trigger the benchmark again just for memtier_benchmark-1Mkeys-load-hash-hmset-5-fields-with-1000B-values.

@sundb sundb requested a review from tezc September 2, 2024 04:32
tezc
tezc previously approved these changes Sep 2, 2024
Copy link
Collaborator

@tezc tezc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish there was a way to tell compiler what we want to do here. (e.g. force inlining and constprop for lpFind())

@fcostaoliveira
Copy link
Collaborator

fcostaoliveira commented Sep 2, 2024

CE Performance Automation : step 1 of 2 (build) DONE.

This comment was automatically generated given a benchmark was triggered.
Started building at 2024-09-05 05:39:05.483802 and took 86 seconds.
You can check each build/benchmark progress in grafana:

  • git hash: c430be6
  • git branch: sundb:regression_lpFind
  • commit date and time: n/a
  • commit summary: n/a
  • test filters:
    • command priority lower limit: 0
    • command priority upper limit: 10000
    • test name regex: .*
    • command group regex: .*

You can check a comparison in detail via the grafana link

@fcostaoliveira
Copy link
Collaborator

fcostaoliveira commented Sep 2, 2024

CE Performance Automation : step 2 of 2 (benchmark) FINISHED.

This comment was automatically generated given a benchmark was triggered.

Started benchmark suite at 2024-10-20 16:25:51.126566 and took 5014.17387 seconds to finish.
Status: [################################################################################] 100.0% completed.

In total will run 135 benchmarks.
- 0 pending.
- 135 completed:
- 0 successful.
- 135 failed.
You can check a the status in detail via the grafana link

@sundb
Copy link
Collaborator Author

sundb commented Sep 2, 2024

@tezc changed as you suggested. cb is correctly inlined from assembly, seems good from the benchmark.

@tezc
Copy link
Collaborator

tezc commented Sep 2, 2024

Interesting!! Having that internal function makes compiler inline it.

One problem, inline is a hint for compiler. Compilers may or may not take that into account. There is a stronger __attribute__((always_inline)) but let's see if inline will work fine first.

@sundb
Copy link
Collaborator Author

sundb commented Sep 2, 2024

taskset -c 10-15 ./src/redis-server --save ""
taskset -c 16-20 memtier_benchmark --pipeline 30 --test-time 180 -c 1000 -t 2 -x 1 --hide-histogram --command='HSET h a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8 i 9 j 10 k 11 l 12 m 13 n 14 o 15 p 16 q 17 r 18 s 19 t 20'

before #13279 (95cbe87):

2         Threads
1000      Connections per thread
180       Seconds


ALL STATS
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Hsets      415770.56       144.24132       149.50300       209.91900       229.37500    128710.22 
Totals     415770.56       144.24132       149.50300       209.91900       229.37500    257420.45 

this PR:

2         Threads
1000      Connections per thread
180       Seconds


ALL STATS
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Hsets      421393.52       142.19897       147.45500       206.84700       223.23100    130450.93 
Totals     421393.52       142.19897       147.45500       206.84700       223.23100    260901.85 

@tezc faster than before regression.

@tezc
Copy link
Collaborator

tezc commented Sep 2, 2024

I remember I checked this while adding this function. I thought it was inlining it. I remember adding inline to lpFindCmp and checking the performance. Then, I think it stopped inlining when I added further changes..

Looks like having a static inline function lpFindCbInternal() is the trick to convince compiler to inline it. Great finding!

@sundb sundb merged commit de7f2f8 into redis:unstable Sep 3, 2024
14 checks passed
@sundb sundb deleted the regression_lpFind branch September 3, 2024 14:53
@YaacovHazan YaacovHazan added the release-notes indication that this issue needs to be mentioned in the release notes label Sep 5, 2024
@YaacovHazan YaacovHazan mentioned this pull request Sep 11, 2024
YaacovHazan added a commit that referenced this pull request Sep 12, 2024
### New Features in binary distributions

- 7 new data structures: JSON, Time series, Bloom filter, Cuckoo filter,
Count-min sketch, Top-k, t-digest
- Redis scalable query engine (including vector search)

### Potentially breaking changes

- #12272 `GETRANGE` returns an empty bulk when the negative end index is
out of range
- #12395 Optimize `SCAN` command when matching data type

### Bug fixes

- #13510 Fix `RM_RdbLoad` to enable AOF after RDB loading is completed
- #13489 `ACL CAT` - return module commands
- #13476 Fix a race condition in the `cache_memory` of `functionsLibCtx`
- #13473 Fix incorrect lag due to trimming stream via `XTRIM` command
- #13338 Fix incorrect lag field in `XINFO` when tombstone is after the
`last_id` of the consume group
- #13470 On `HDEL` of last field - update the global hash field
expiration data structure
- #13465 Cluster: Pass extensions to node if extension processing is
handled by it
- #13443 Cluster: Ensure validity of myself when loading cluster config
- #13422 Cluster: Fix `CLUSTER SHARDS` command returns empty array

### Modules API

- #13509 New API calls: `RM_DefragAllocRaw`, `RM_DefragFreeRaw`, and
`RM_RegisterDefragCallbacks` - defrag API to allocate and free raw
memory

### Performance and resource utilization improvements

- #13503 Avoid overhead of comparison function pointer calls in listpack
`lpFind`
- #13505 Optimize `STRING` datatype write commands
- #13499 Optimize `SMEMBERS` command
- #13494 Optimize `GEO*` commands reply
- #13490 Optimize `HELLO` command
- #13488 Optimize client query buffer
- #12395 Optimize `SCAN` command when matching data type
- #13529 Optimize `LREM`, `LPOS`, `LINSERT`, and `LINDEX` commands
- #13516 Optimize `LRANGE` and other commands that perform several
writes to client buffers per call
- #13431 Avoid `used_memory` contention when updating from multiple
threads

### Other general improvements

- #13495 Reply `-LOADING` on replica while flushing the db

### CLI tools

- #13411 redis-cli: Fix wrong `dbnum` showed after the client
reconnected

### Notes

- No backward compatibility for replication or persistence.
- Additional distributions, upgrade paths, features, and improvements
will be introduced in upcoming pre-releases.
- With the GA release of 8.0 we will deprecate Redis Stack.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action:run-benchmark Triggers the benchmark suite for this Pull Request release-notes indication that this issue needs to be mentioned in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants