Skip to content

Add listpack encoding for list#11303

Merged
oranagra merged 48 commits intoredis:unstablefrom
sundb:list_listpack_encoding
Nov 16, 2022
Merged

Add listpack encoding for list#11303
oranagra merged 48 commits intoredis:unstablefrom
sundb:list_listpack_encoding

Conversation

@sundb
Copy link
Collaborator

@sundb sundb commented Sep 22, 2022

Implement #11156

Description of the feature

The new listpack encoding uses the old list-max-listpack-size config to perform the conversion, which we can think it of as a node inside a quicklist, but without 80 bytes overhead (internal fragmentation included) of quicklist and quicklistNode structs.
For example, a list key with 5 items of 10 chars each, now takes 128 bytes instead of 208 it used to take.

Conversion rules

  • Convert listpack to quicklist
    When the listpack length or size reaches the list-max-listpack-size limit, it will be converted to a quicklist.
  • Convert quicklist to listpack
    When a quicklist has only one node, and its length or size is reduced to half of the list-max-listpack-size limit, it will be converted to a listpack.
    This is done to avoid frequent conversions when we add or remove at the bounding size or length.

Interface changes

  1. add list entry param to listTypeSetIteratorDirection
    When list encoding is listpack, listTypeIterator->lpi points to the next entry of current entry,
    so when changing the direction, we need to use the current node (listTypeEntry->p) to
    update listTypeIterator->lpi to the next node in the reverse direction.

Benchmark

HW: Cloud server, Intel Xeon Cooper Lake, 8cores, 16G mem, Ubuntu 22, GCC 11.3

Listpack VS Quicklist with one node

  • LPUSH
    Use redis-benchmark -n 1000000 -r 10000000 -q eval 'for i=1,N,1 do redis.call("rpush", "lst:"..ARGV[1], "hellohello") end' 0 __rand_int__ to create 1M lists with N entries of 10 chars.
N Unstable (quicklist) This PR (listpack) Difference
10 71515.41 (p50=0.663) 71694.87 (p50=0.655) +0.25%
30 39987.20 (p50=1.303) 40109.10 (p50=1.303) +0.3%
50 27913.47 (p50=1.951) 27824.15 (p50=1.959) -0.3%
  • LRANGE

Dataset Preparation: Use redis-benchmark -n 600 -q RPUSH mylist hello to create a list with 600 entries of 10 chars.

memtier_benchmark --hide-histogram -n 50000 --pipeline=10 --command=<COMMAND>

Command Unstable (listpack) This PR (listpack) Difference
LRANGE mylist 0 100 164007.37 (p50=12.67100) 186012.67 (p50=10.94300) +13.4%
LRANGE mylist 0 300 65296.66 (p50=32.12700) 74293.38 (p50=28.79900) +13.7%
LRANGE mylist 0 500 41626.12 (p50=50.94300) 47235.58 (p50=45.82300) +13.4%
LRANGE mylist 0 600 34868.03 (p50=60.92700) 39477.82 (p50=55.03900) +13.2%

Both are quicklist

  • LRANGE

Dataset Preparation: Use redis-benchmark -q -n 1000000 RPUSH mylist hello to create a list with 1M entries.

Without pipeline
memtier_benchmark --hide-histogram -n 50000 --command=<COMMAND>

Command Unstable (Quicklist) This PR (Quicklist) Difference
LRANGE mylist 0 100 89279.56 (p50=2.20700) 93242.59 (p50=2.04700) +4.4%
LRANGE mylist 0 300 47908.40 (p50=3.99900) 50459.88 (p50=3.80700) +5.3%
LRANGE mylist 0 500 34696.12 (p50=5.69500) 35462.93 (p50=5.50300) +2.2%
LRANGE mylist 0 600 29447.42 (p50=6.65500) 30381.95 (p50=6.36700) +3.1%

With pipeline
memtier_benchmark --hide-histogram -n 50000 --pipeline=10 --command=<COMMAND>

Command Unstable (Quicklist) This PR (Quicklist) Difference
LRANGE mylist 0 100 167135.03 (p50=12.67100) 172968.38 (p50=11.96700) +3.4%
LRANGE mylist 0 300 65079.61 (p50=32.38300) 68443.57 (p50=31.10300) +5.1%
LRANGE mylist 0 500 41846.02 (p50=51.19900) 42757.76 (p50=49.66300) +2.1%
LRANGE mylist 0 600 34865.95 (p50=60.92700) 36438.67 (p50=59.39100) +4.5%
  • Other commands
  • redis-benchmark -q -n 1000000 -t LRANGE COMMAND
Command Unstable (Quicklist) This PR (Quicklist) Difference
LPOP mylist 126519.82 (p50=0.199) 127554.27 (p50=0.199) +0.8%
RPOP mylist 127140.73 (p50=0.199) 126363.14 (p50=0.199) -0.6%
LSET mylist 250000 a 126984.12 (p50=0.327) 126454.23 (p50=0.327) -0.4%

From the benchmark, as we can see from the results

  1. When list is quicklist encoding, LRANGE improves performance by <5%.
  2. When list is listpack encoding, LRANGE improves performance by ~13%, the main enhancement is brought by addListListpackRangeReply().

Memory usage

Quote from @zuiderkwast
Use a modified version of debug populate to create 1M lists(key:0~key:1000000) with 5 items of 10 chars ("hellohello") each.
Checked using ./redis-cli info memory | grep ^used_memory_human shows memory usage down by 35.49%.

dataset Unstable(quicklist) This PR (listpack) Difference
1M lists with 5 items of 10 chars 214.86M 138.59M -35.49%

Note

  1. Add conversion callback to support doing some work before conversion
    Since the quicklist iterator decompresses the current node when it is released, we can
    no longer decompress the quicklist after we convert the list.

@sundb sundb requested a review from zuiderkwast September 22, 2022 08:32
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! This is a quick review, i.e. I didn't check the correctness of everything. I assume you made sure everything is covered by tests.

oranagra pushed a commit that referenced this pull request Sep 28, 2022
…y in serveClientBlockedOnList (#11326)

Mainly fix two minor bug
1. When handle BL*POP/BLMOVE commands with blocked clients, we should increment server.dirty.
2.  `listPopRangeAndReplyWithKey()` in `serveClientBlockedOnList()` should not repeat calling
   `signalModifiedKey()` which has been called when an element was pushed on the list.
   (was skipped in all bpop commands, other than blmpop) 

Other optimization
add `signal` param for `listElementsRemoved` to skip `signalModifiedKey()` to unify all pop operation.

Unifying all pop operations also prepares us for #11303, so that we can avoid having to deal with the
conversion from quicklist to listpack() in various places when the list shrinks.
@sundb sundb force-pushed the list_listpack_encoding branch from a9aa447 to 4db5be0 Compare October 9, 2022 07:58
@sundb sundb marked this pull request as ready for review October 9, 2022 11:35
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial round of review (excluding the tests code)

p.s. maybe add an example about the memory saving to the top comment (e.g. plain case of a list of 10 small numeric items)

@sundb
Copy link
Collaborator Author

sundb commented Nov 13, 2022

@oranagra @zuiderkwast
I tried the benchmark with a new HW (Intel Xeon Cooper Lake, 8cores, 16G mem).
It looks like sundb@9984443 doesn't offer a benefit, it just reduces the performance in the case of lack of performance will be a little faster, but still 20% slower than the old code of addListRangeReply().
Please see the new benchmark results in the top comment, LRANGE performance will drop a little.

@sundb
Copy link
Collaborator Author

sundb commented Nov 13, 2022

Here is what is seen via perf.
There is basically no difference between 1 and 2, and when I remove the else if (o->encoding == OBJ_ENCODING_LISTPACK) from 2, 2 and 3 are basically the same, so the performance drop is unavoidable.

  1. This PR
    before3

  2. sundb@9984443
    after2

  3. Unstable
    unstable3

@sundb
Copy link
Collaborator Author

sundb commented Nov 14, 2022

@oranagra I re-benchmarked using memtier_benchmark with pipeline, but note that I still use --threads=1, otherwise redis-server would reach 100% cpu usage.
The results are close to redis-benchmark, with LRANGE reaching a performance degradation of <=0.5.
BTW, I didn't test LPOP , RPOP and LSET with memtier_benchmark, because their OPS results have problems (probably a bug in memtier_benchmark).

@zuiderkwast
Copy link
Contributor

redis-server 100% CPU usage is what we want, isn't it?

@oranagra
Copy link
Member

redis-server 100% CPU usage is what we want, isn't it?

yes, we do want the server (CPU) to be the bottle neck, otherwise, it means something else is the bottleneck and we're not measuring the throughput of redis.

@sundb
Copy link
Collaborator Author

sundb commented Nov 14, 2022

@zuiderkwast @oranagra Yes, that's what I did yesterday (100% cpu usage), normally the new code would be slower than the old code, but when I didn't use -threads=1 yesterday, the test results confused me, the new code was sometimes as fast as the old code, making me wonder if the 100% usage was causing the uncertainty.
Anyway, let me do it again.

@sundb
Copy link
Collaborator Author

sundb commented Nov 15, 2022

prepare dataset : redis-benchmark -q -n 1000000 RPUSH mylist hello
benchmark: memtier_benchmark --hide-histogram -n 50000 --pipeline=10 --command="LRANGE mylist 0 300"

Unstable

Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges     67365.18        30.13873        31.87100        49.66300        60.15900    221107.78 
Totals      67365.18        30.13873        31.87100        49.66300        60.15900    221107.78 

This PR with commit

Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges     66585.07        29.82411        31.23100        52.99100        61.43900    218547.27 
Totals      66585.07        29.82411        31.23100        52.99100        61.43900    218547.27

This PR without commit

Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges     61896.84        32.42271        34.55900        55.80700        63.99900    203159.46 
Totals      61896.84        32.42271        34.55900        55.80700        63.99900    203159.46

The performance degradation is mainly caused by the inline of addListRangeReply().
The code in the loop in addListRangeReply() will be expanded in the old code, but not in the new one, resulting in a lot of function call overhead.
In the last commit, I split the processing of quicklist and listpack into smaller methods, making it easier to expand the iterators in each of them, which works as evidenced by the test results and perf.
But I'm not sure if it works under a different compiler or platform.
@oranagra @zuiderkwast WDYT?

image

image

@oranagra
Copy link
Member

@sundb so the difference between your last commit and sundb@9984443 is that you've put each of these in a separate function instead of an if-else in the outer function, and that allowed for a better inlining?

i don't think we wanna go deeper in that direction (being one step away from manual inlining of code, or conversion to assembly).
moving the if outside the loop should have been enough.
if it isn't then we can mark these functions as inline or convert them to macros, but in this case it's not applicable (unless we want to make two variants of each of these, one as macro and the other without).

bottom line, if what you did was enough to convince the compiler to inline, i'd be ok with that (just ask to write a comment in the code specifying that).
i'd like to check that it works on both gcc and clang (some recent version), but i don't think it'll change anything if we find out it doesn't.

@sundb
Copy link
Collaborator Author

sundb commented Nov 16, 2022

  1. (unstable + clang)(1) is slower than (Unstable + gcc)(3)
    Because gcc will inline the code in the loop, but clang will not.
    So 4 will also be faster than 2

  2. Commits a542fc0 (#11303) is always valid, whether it's clang or gcc, it's a little faster than old code

benchmark: memtier_benchmark --hide-histogram -n 50000 --pipeline=10 --command="LRANGE mylist 0 300"

clang 14.0.0

  1. Unstable
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges     65372.17        31.03445        33.02300        53.24700        60.92700    214566.27 
Totals      65372.17        31.03445        33.02300        53.24700        60.92700    214566.27
  1. This PR with a542fc0 (#11303)
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges     65429.01        30.64463        32.12700        52.73500        61.18300    214752.85 
Totals      65429.01        30.64463        32.12700        52.73500        61.18300    214752.85 

gcc 11.3

  1. Unstable
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges     66881.59        30.50183        32.12700        52.73500        61.95100    219520.54 
Totals      66881.59        30.50183        32.12700        52.73500        61.95100    219520.54
  1. This PR with a542fc0 (#11303)
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges     67100.45        30.10728        31.23100        53.50300        63.48700    220238.87 
Totals      67100.45        30.10728        31.23100        53.50300        63.48700    220238.87

@oranagra
Copy link
Member

so clang didn't used to do any inlining in that code, and thus unaffected by the change in this PR anyway.
and for GCC, which did do inlining, your recent change convinced it to keep doing it, and thus no regression either.
did i get it right? in which case i think we're good to go (i rather not dive into macros, and just let the compiler do what it can).

@sundb
Copy link
Collaborator Author

sundb commented Nov 16, 2022

so clang didn't used to do any inlining in that code, and thus unaffected by the change in this PR anyway. and for GCC, which did do inlining, your recent change convinced it to keep doing it, and thus no regression either. did i get it right? in which case i think we're good to go (i rather not dive into macros, and just let the compiler do what it can).

Yes.

@oranagra
Copy link
Member

@sundb please update the top comment and let me know when ready to be merged.

@sundb
Copy link
Collaborator Author

sundb commented Nov 16, 2022

@oranagra Updated, mainly the LRANGE part, which seems to bring a huge performance boost when list is listpack encoding, and I verified this boost on my local pc.

@oranagra oranagra merged commit 2168ccc into redis:unstable Nov 16, 2022
@sundb sundb deleted the list_listpack_encoding branch November 17, 2022 02:02
@oranagra
Copy link
Member

@sundb wanted to take a moment thank you again for the time you invested in this project and others!

madolson pushed a commit to madolson/redis that referenced this pull request Apr 19, 2023
…y in serveClientBlockedOnList (redis#11326)

Mainly fix two minor bug
1. When handle BL*POP/BLMOVE commands with blocked clients, we should increment server.dirty.
2.  `listPopRangeAndReplyWithKey()` in `serveClientBlockedOnList()` should not repeat calling
   `signalModifiedKey()` which has been called when an element was pushed on the list.
   (was skipped in all bpop commands, other than blmpop) 

Other optimization
add `signal` param for `listElementsRemoved` to skip `signalModifiedKey()` to unify all pop operation.

Unifying all pop operations also prepares us for redis#11303, so that we can avoid having to deal with the
conversion from quicklist to listpack() in various places when the list shrinks.
madolson pushed a commit to madolson/redis that referenced this pull request Apr 19, 2023
Improve memory efficiency of list keys

## Description of the feature
The new listpack encoding uses the old `list-max-listpack-size` config
to perform the conversion, which we can think it of as a node inside a
quicklist, but without 80 bytes overhead (internal fragmentation included)
of quicklist and quicklistNode structs.
For example, a list key with 5 items of 10 chars each, now takes 128 bytes
instead of 208 it used to take.

## Conversion rules
* Convert listpack to quicklist
  When the listpack length or size reaches the `list-max-listpack-size` limit,
  it will be converted to a quicklist.
* Convert quicklist to listpack
  When a quicklist has only one node, and its length or size is reduced to half
  of the `list-max-listpack-size` limit, it will be converted to a listpack.
  This is done to avoid frequent conversions when we add or remove at the bounding size or length.
    
## Interface changes
1. add list entry param to listTypeSetIteratorDirection
    When list encoding is listpack, `listTypeIterator->lpi` points to the next entry of current entry,
    so when changing the direction, we need to use the current node (listTypeEntry->p) to 
    update `listTypeIterator->lpi` to the next node in the reverse direction.

## Benchmark
### Listpack VS Quicklist with one node
* LPUSH - roughly 0.3% improvement
* LRANGE - roughly 13% improvement

### Both are quicklist
* LRANGE - roughly 3% improvement
* LRANGE without pipeline - roughly 3% improvement

From the benchmark, as we can see from the results
1. When list is quicklist encoding, LRANGE improves performance by <5%.
2. When list is listpack encoding, LRANGE improves performance by ~13%,
   the main enhancement is brought by `addListListpackRangeReply()`.

## Memory usage
1M lists(key:0~key:1000000) with 5 items of 10 chars ("hellohello") each.
shows memory usage down by 35.49%, from 214MB to 138MB.

## Note
1. Add conversion callback to support doing some work before conversion
    Since the quicklist iterator decompresses the current node when it is released, we can 
    no longer decompress the quicklist after we convert the list.
NickCraver added a commit to StackExchange/StackExchange.Redis that referenced this pull request May 2, 2023
I'm seeing a few issues here:

1. The internal encoding has changed 1 spot to listpack, which is correct from redis/redis#11303, but is missing in the docs at https://redis.io/commands/object-encoding/ (fixed in tests themselves)
1. The `HackyGetPerf` reliably returns 0 now, regardless of how long has passed (e.g. upping iterations tremendously)...this may legit be bugged.
1. `StreamAutoClaim_IncludesDeletedMessageId` expectations are broken, not sure what to make of this yet but it's an odd change to hit between 7.0 and 7.2 versions.
NickCraver added a commit to StackExchange/StackExchange.Redis that referenced this pull request Jun 14, 2023
I'm seeing a few issues here that we need to resolve for the upcoming Redis 7.2 release:
- [x] The internal encoding has changed 1 spot to listpack, which is correct from redis/redis#11303, but is missing in the docs at https://redis.io/commands/object-encoding/ (fixed in tests themselves)
- [x] The `HackyGetPerf` reliably returns 0 now, regardless of how long has passed (e.g. upping iterations tremendously)...this may legit be bugged.
- [x] `StreamAutoClaim_IncludesDeletedMessageId` expectations are broken, not sure what to make of this yet but it's an odd change to hit between 7.0 and 7.2 versions.

Note: no release notes because these are all test tweaks.

Co-authored-by: slorello89 <[email protected]>
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
…y in serveClientBlockedOnList (redis#11326)

Mainly fix two minor bug
1. When handle BL*POP/BLMOVE commands with blocked clients, we should increment server.dirty.
2.  `listPopRangeAndReplyWithKey()` in `serveClientBlockedOnList()` should not repeat calling
   `signalModifiedKey()` which has been called when an element was pushed on the list.
   (was skipped in all bpop commands, other than blmpop) 

Other optimization
add `signal` param for `listElementsRemoved` to skip `signalModifiedKey()` to unify all pop operation.

Unifying all pop operations also prepares us for redis#11303, so that we can avoid having to deal with the
conversion from quicklist to listpack() in various places when the list shrinks.
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
Improve memory efficiency of list keys

## Description of the feature
The new listpack encoding uses the old `list-max-listpack-size` config
to perform the conversion, which we can think it of as a node inside a
quicklist, but without 80 bytes overhead (internal fragmentation included)
of quicklist and quicklistNode structs.
For example, a list key with 5 items of 10 chars each, now takes 128 bytes
instead of 208 it used to take.

## Conversion rules
* Convert listpack to quicklist
  When the listpack length or size reaches the `list-max-listpack-size` limit,
  it will be converted to a quicklist.
* Convert quicklist to listpack
  When a quicklist has only one node, and its length or size is reduced to half
  of the `list-max-listpack-size` limit, it will be converted to a listpack.
  This is done to avoid frequent conversions when we add or remove at the bounding size or length.
    
## Interface changes
1. add list entry param to listTypeSetIteratorDirection
    When list encoding is listpack, `listTypeIterator->lpi` points to the next entry of current entry,
    so when changing the direction, we need to use the current node (listTypeEntry->p) to 
    update `listTypeIterator->lpi` to the next node in the reverse direction.

## Benchmark
### Listpack VS Quicklist with one node
* LPUSH - roughly 0.3% improvement
* LRANGE - roughly 13% improvement

### Both are quicklist
* LRANGE - roughly 3% improvement
* LRANGE without pipeline - roughly 3% improvement

From the benchmark, as we can see from the results
1. When list is quicklist encoding, LRANGE improves performance by <5%.
2. When list is listpack encoding, LRANGE improves performance by ~13%,
   the main enhancement is brought by `addListListpackRangeReply()`.

## Memory usage
1M lists(key:0~key:1000000) with 5 items of 10 chars ("hellohello") each.
shows memory usage down by 35.49%, from 214MB to 138MB.

## Note
1. Add conversion callback to support doing some work before conversion
    Since the quicklist iterator decompresses the current node when it is released, we can 
    no longer decompress the quicklist after we convert the list.
sundb added a commit to sundb/redis that referenced this pull request Feb 18, 2024
After redis#11303, list will exit with listpack when the number of quicklist
nodes are less than 1, cause some list related tests to be always run in
listpack encoding.
Now let them being run in both encoding.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approval-needed Waiting for core team approval to be merged release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants