Skip to content

Fix random element in skewed sparse hash table#2085

Merged
zuiderkwast merged 1 commit intovalkey-io:unstablefrom
zuiderkwast:random-sparse
May 19, 2025
Merged

Fix random element in skewed sparse hash table#2085
zuiderkwast merged 1 commit intovalkey-io:unstablefrom
zuiderkwast:random-sparse

Conversation

@zuiderkwast
Copy link
Contributor

@zuiderkwast zuiderkwast commented May 14, 2025

In cases when half of the elements in a hashtable have been deleted in scan order, i.e. using scan and deleting the returned elements, the random selection algorithm becomes very slow. Also, if a random element is deleted repeatedly, this can make the hash table to become skewed and make random selection very slow. Behavior without this fix:

  • The fair random algorithm samples some elements by picking a random cursor and then scanning from that point until it finds a number of elements and then picking a random one of them.
  • Now, assume there is an empty sequence of buckets (in scan order).
  • When we pick a random cursor and it points into this empty sequence, we continue scanning until we find some elements and then pick one of these. Therefore, the buckets just after an empty sequence are more likely to be picked than other buckets.
  • If we repeat SPOP many times, this area just after the empty section will be emptied too and the empty section grows, the problem gets worse and worse.

Bucket distribution (x = non-empty bucket, . = empty bucket)

x..xx.x.xx.x.........x.xx..x.x..x.xx..x.x
              ^      ^ ^^  ^ ^
              |      | ||  | |
            random   elements
            cursor    sampled

This PR changes the sampling to pick a new random index to scan for each new bucket chain that has been sampled. This is more similar to how the fair random works in dict.c.

Additionally, this PR includes two small optimizations/tuning:

  1. The change allows us to lower the sample sizes for fair random element selection. There are unit tests to verify the fairness of the algorithm.
  2. Additionally, the sampling size is decreased even more in very sparse tables to prevent it from sampling empty buckets many times which makes it slow.

One existing unit test is changed from a stack-allocated to a heap-allocated array, because it was observed to use too much stack memory when running locally with the --large-memory option.

Fixes #2019.

@codecov
Copy link

codecov bot commented May 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.25%. Comparing base (fe97ba4) to head (e8986f1).
Report is 14 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2085      +/-   ##
============================================
+ Coverage     71.08%   71.25%   +0.17%     
============================================
  Files           122      122              
  Lines         66210    66022     -188     
============================================
- Hits          47065    47044      -21     
+ Misses        19145    18978     -167     
Files with missing lines Coverage Δ
src/hashtable.c 81.37% <100.00%> (ø)

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Member

@ranshid ranshid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks simple to me.
From what I get from this, this does not fix the issue though right?

size_t cursor = randomSizeT();
while (samples.seen < count) {
cursor = hashtableScan(ht, cursor, sampleEntriesScanFn, &samples);
hashtableScan(ht, randomSizeT(), sampleEntriesScanFn, &samples);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this will have performance impact on random commands?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not really. Memory access is the slow operation here. PRNG is fast compared to that.

Copy link
Member

@soloestoy soloestoy May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, using a random cursor breaks the SCAN method’s guarantee of uniqueness, risking duplicate key processing. For functions like evict, this could degrade performance or cause logical errors.

Copy link
Contributor Author

@zuiderkwast zuiderkwast May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the code to get some random keys, hashtableSampleEntries, similar to dictGetSomeKeys. It uses the scan implementation at a random cursor because the scan implementation can already handle rehashing and skip indexes that have already been rehashed.

In SCAN we don't use a random cursor.

Eviction does use kvstoreHashtableSampleEntries so you're right, it can now return duplicate entries. It's the same in dictGetSomeKeys. This PR makes hashtable sampling more similar to dictGetSomeKeys.

By default maxmemory_samples is 5, so it basically only samples one or two buckets.

Copy link
Contributor

@Fusl Fusl May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eviction does use kvstoreHashtableSampleEntries so you're right, it can now return duplicate entries. It's the same in dictGetSomeKeys. This PR makes hashtable sampling more similar to dictGetSomeKeys.

It appears that's a more than acceptable trade-off considering we're basically improving the overall performance (at the very least for sparse hash tables) in here and a high keyspace churn rate (due repeat memory evictions) can evidently also quickly cause the keyspace hash table to become skewed, this is IMHO a net-positive tradeoff we're making.

Running a test with --maxmemory 100mb+--maxmemory-policy allkeys-random and filling the primary kv hashtable with random string entries:
Before this change, the primary kv hash table chain length can easily extend to over 40. After this change, the chain length stays under 10 for most of the time.

Red line is 8.1.1, blue line is 8.1.1 with the PR patch applied:

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already tried only adjusting FAIR_RANDOM_SAMPLE_SIZE downwards (20, 10, 5, and 1 times of ENTRIES_PER_BUCKET) in my very early investigation when I opened the issue and it only made matters significantly worse. I was easily able to reproduce the issue when adjusting that multiplier downwards in my testing (chain length would skyrocket very quickly) but didn't think it was worth mentioning since it wouldn't show the actual issue on an unmodified code base.

I think the only reason we're adjusting the FAIR_RANDOM_SAMPLE_SIZE multiplier down in here is that it's not necessary to do that many samples anymore after we're now randomizing the cursor during sampling rather than iterating on the same cursor and we therefore won't as easily run into the same randomness bias issue.

Copy link
Contributor

@Fusl Fusl May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I would even make the argument that with this entire patch applied it may be worth considering completely getting rid of FAIR_RANDOM_SAMPLE_SIZE and only using WEAK_RANDOM_SAMPLE_SIZE (and just rename it to RANDOM_SAMPLE_SIZE) since that also improved spop/srandmember performance by a good amount in my benchmarks without sacrificing too much of the randomness quality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the primary reason for the improvement might be adjusting the FAIR_RANDOM_SAMPLE_SIZE. What would be the outcome if we only modified FAIR_RANDOM_SAMPLE_SIZE while keeping the previous hashtableScan implementation unchanged?

@soloestoy Not at all. Only lowering the sample size makes the randomness worse without solving the issue.

I need to explain why this problem arises from the scan-order sampling:

  • The fair random algorithm samples some elements by picking a random cursor and then scanning from that point until it finds a number of elements and then picking a random one of them.
  • Now, assume there is an empty sequence (in scan order) of buckets.
  • When we pick a random cursor and it points into this empty sequence, we continue scanning until we find some elements and then pick one of these. Therefore, the buckets just after an empty sequence are more likely to be picked than other buckets.
  • If we repeat SPOP many times, this area just after the empty section will be emptied too and the empty section grows, the problem gets worse and worse.
    |x.xx.x.xx.x.........x.xx..x.x..x.xx..x.|
                  ^      ^ ^^  ^ ^
                  |      | ||  | |
                random   elements
                cursor    sampled

Making sampling not follow any scan order solves the issue. The bad algorithm is my mistake. I wrote it when I implemented the hashtable. The dictGetFairRandomKey does just what this PR does: it picks a random bucket every time. I should have followed the dict's design here but I thought it was smart to sample in scan order. But it wasn't.

We can keep the same sample size, but we have a unit test where we actually do some statistical measurement of the fairness. Without this change, that fairness test would fail with a smaller sample size. This change allows us to lower the sample size.

We could probably lower it more, but I didn't want to risk the fairness of it in cases like ongoing rehashing, very sparse tables, etc. Maybe I'm too conservative, but to lower it more, we should tune this more carefully. The unit test code is useful for tuning it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to add comments to explain the history.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will improve the PR top comment.

@Fusl
Copy link
Contributor

Fusl commented May 15, 2025

From what I get from this, this does not fix the issue though right?

I just added an update to that issue. It appears that this patch solves the issue we've been running into.

@zuiderkwast
Copy link
Contributor Author

Overall looks simple to me. From what I get from this, this does not fix the issue though right?

In that comment, @Fusl was referring to another patch which was wrong and didn't fix the problem, this one: #2019 (comment)

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@soloestoy
Copy link
Member

I have not yet carefully reviewed the hashtableScanDefrag function, but based on the implementation method of dictGetSomeKeys, I have two questions:

  1. Does hashtableScanDefrag quickly skip the rehashidx when encountering an ongoing rehash during the scan?
  2. For sparse hash tables, are there additional termination conditions (e.g., exiting the scan if a certain proportion of empty buckets are accessed), even if the target number of keys has not been reached?"

@zuiderkwast
Copy link
Contributor Author

I have not yet carefully reviewed the hashtableScanDefrag function, but based on the implementation method of dictGetSomeKeys, I have two questions:

  1. Does hashtableScanDefrag quickly skip the rehashidx when encountering an ongoing rehash during the scan?

Yes, it skips indexes in the source table (table 0) if they have already been rehashed.

  1. For sparse hash tables, are there additional termination conditions (e.g., exiting the scan if a certain proportion of empty buckets are accessed), even if the target number of keys has not been reached?"

No, but it does fewer samples when the table is sparse, so it is potentially less fair. This makes it faster but it loops until it finds an entry. This is similar to how dictGetFairRandomKey has a fallback to dictGetRandomKey when dictGetSomeKeys didn't return any elements. dictGetRandomKey is less fair but it loops until it finds an entry.

Copy link
Member

@ranshid ranshid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change LGTM.

Copy link
Contributor

@Fusl Fusl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an authoritative PR approval but this LGTM overall, just one note

void *samples[FAIR_RANDOM_SAMPLE_SIZE];
unsigned count = hashtableSampleEntries(ht, &samples[0], FAIR_RANDOM_SAMPLE_SIZE);
/* Sample less if it's very sparse. */
size_t num_samples = hashtableSize(ht) >= hashtableBuckets(ht) ? FAIR_RANDOM_SAMPLE_SIZE : WEAK_RANDOM_SAMPLE_SIZE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hashtableSize(ht) >= hashtableBuckets(ht) expression in my testing barely ever evaluates to false and when it does, we're typically already (or about to be) in the process of rehashing the hashtable to a smaller size. It doesn't hurt to reduce the sampling rate during this time, just figured it was worth a note in case this logic was intended to be different.

Copy link
Contributor Author

@zuiderkwast zuiderkwast May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was conservative when I picked this threshold. I didn't want to sample too few elements to get unfair randomness in normal circumstances. This case will kick in at fill factor 14%. If resize policy is set to AVOID, the table can be very sparse: MIN_FILL_PERCENT_HARD is just 3%.

@zuiderkwast zuiderkwast moved this to In Progress in Valkey 8.1 May 19, 2025
@zuiderkwast zuiderkwast merged commit 45d03e6 into valkey-io:unstable May 19, 2025
51 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to To be backported in Valkey 8.1 May 19, 2025
@zuiderkwast zuiderkwast deleted the random-sparse branch May 19, 2025 09:08
@zuiderkwast zuiderkwast added performance bug Something isn't working labels May 21, 2025
hpatro pushed a commit to hpatro/valkey that referenced this pull request Jun 4, 2025
In cases when half of the elements in a hashtable have been deleted in
scan order, i.e. using scan and deleting the returned elements, the
random selection algorithm becomes very slow. Also, if a random element
is deleted repeatedly, this can make the hash table to become skewed and
make random selection very slow. Behavior without this fix:

* The fair random algorithm samples some elements by picking a random
  cursor and then scanning from that point until it finds a number of
  elements and then picking a random one of them.
* Now, assume there is an empty sequence of buckets (in scan order).
* When we pick a random cursor and it points into this empty sequence,
  we continue scanning until we find some elements and then pick one of
  these. Therefore, the buckets just after an empty sequence are more
  likely to be picked than other buckets.
* If we repeat SPOP many times, this area just after the empty section
  will be emptied too and the empty section grows, the problem gets
  worse and worse.

Bucket distribution (x = non-empty bucket, . = empty bucket)

    x..xx.x.xx.x.........x.xx..x.x..x.xx..x.x
                  ^      ^ ^^  ^ ^
                  |      | ||  | |
                random   elements
                cursor    sampled

This PR changes the sampling to pick a new random index to scan for each
new bucket chain that has been sampled. This is more similar to how the
fair random works in dict.c.

Additionally, this PR includes two small optimizations/tuning:

1. The change allows us to lower the sample sizes for fair random
   element selection. There are unit tests to verify the fairness of the
   algorithm.
2. Additionally, the sampling size is decreased even more in very sparse
   tables to prevent it from sampling empty buckets many times which
   makes it slow.

One existing unit test is changed from a stack-allocated to a
heap-allocated array, because it was observed to use too much stack
memory when running locally with the `--large-memory` option.

Fixes valkey-io#2019.

Signed-off-by: Viktor Söderqvist <[email protected]>
hpatro pushed a commit to hpatro/valkey that referenced this pull request Jun 4, 2025
In cases when half of the elements in a hashtable have been deleted in
scan order, i.e. using scan and deleting the returned elements, the
random selection algorithm becomes very slow. Also, if a random element
is deleted repeatedly, this can make the hash table to become skewed and
make random selection very slow. Behavior without this fix:

* The fair random algorithm samples some elements by picking a random
  cursor and then scanning from that point until it finds a number of
  elements and then picking a random one of them.
* Now, assume there is an empty sequence of buckets (in scan order).
* When we pick a random cursor and it points into this empty sequence,
  we continue scanning until we find some elements and then pick one of
  these. Therefore, the buckets just after an empty sequence are more
  likely to be picked than other buckets.
* If we repeat SPOP many times, this area just after the empty section
  will be emptied too and the empty section grows, the problem gets
  worse and worse.

Bucket distribution (x = non-empty bucket, . = empty bucket)

    x..xx.x.xx.x.........x.xx..x.x..x.xx..x.x
                  ^      ^ ^^  ^ ^
                  |      | ||  | |
                random   elements
                cursor    sampled

This PR changes the sampling to pick a new random index to scan for each
new bucket chain that has been sampled. This is more similar to how the
fair random works in dict.c.

Additionally, this PR includes two small optimizations/tuning:

1. The change allows us to lower the sample sizes for fair random
   element selection. There are unit tests to verify the fairness of the
   algorithm.
2. Additionally, the sampling size is decreased even more in very sparse
   tables to prevent it from sampling empty buckets many times which
   makes it slow.

One existing unit test is changed from a stack-allocated to a
heap-allocated array, because it was observed to use too much stack
memory when running locally with the `--large-memory` option.

Fixes valkey-io#2019.

Signed-off-by: Viktor Söderqvist <[email protected]>
Signed-off-by: Harkrishn Patro <[email protected]>
hpatro pushed a commit that referenced this pull request Jun 9, 2025
In cases when half of the elements in a hashtable have been deleted in
scan order, i.e. using scan and deleting the returned elements, the
random selection algorithm becomes very slow. Also, if a random element
is deleted repeatedly, this can make the hash table to become skewed and
make random selection very slow. Behavior without this fix:

* The fair random algorithm samples some elements by picking a random
  cursor and then scanning from that point until it finds a number of
  elements and then picking a random one of them.
* Now, assume there is an empty sequence of buckets (in scan order).
* When we pick a random cursor and it points into this empty sequence,
  we continue scanning until we find some elements and then pick one of
  these. Therefore, the buckets just after an empty sequence are more
  likely to be picked than other buckets.
* If we repeat SPOP many times, this area just after the empty section
  will be emptied too and the empty section grows, the problem gets
  worse and worse.

Bucket distribution (x = non-empty bucket, . = empty bucket)

    x..xx.x.xx.x.........x.xx..x.x..x.xx..x.x
                  ^      ^ ^^  ^ ^
                  |      | ||  | |
                random   elements
                cursor    sampled

This PR changes the sampling to pick a new random index to scan for each
new bucket chain that has been sampled. This is more similar to how the
fair random works in dict.c.

Additionally, this PR includes two small optimizations/tuning:

1. The change allows us to lower the sample sizes for fair random
   element selection. There are unit tests to verify the fairness of the
   algorithm.
2. Additionally, the sampling size is decreased even more in very sparse
   tables to prevent it from sampling empty buckets many times which
   makes it slow.

One existing unit test is changed from a stack-allocated to a
heap-allocated array, because it was observed to use too much stack
memory when running locally with the `--large-memory` option.

Fixes #2019.

Signed-off-by: Viktor Söderqvist <[email protected]>
Signed-off-by: Harkrishn Patro <[email protected]>
hpatro pushed a commit that referenced this pull request Jun 11, 2025
In cases when half of the elements in a hashtable have been deleted in
scan order, i.e. using scan and deleting the returned elements, the
random selection algorithm becomes very slow. Also, if a random element
is deleted repeatedly, this can make the hash table to become skewed and
make random selection very slow. Behavior without this fix:

* The fair random algorithm samples some elements by picking a random
  cursor and then scanning from that point until it finds a number of
  elements and then picking a random one of them.
* Now, assume there is an empty sequence of buckets (in scan order).
* When we pick a random cursor and it points into this empty sequence,
  we continue scanning until we find some elements and then pick one of
  these. Therefore, the buckets just after an empty sequence are more
  likely to be picked than other buckets.
* If we repeat SPOP many times, this area just after the empty section
  will be emptied too and the empty section grows, the problem gets
  worse and worse.

Bucket distribution (x = non-empty bucket, . = empty bucket)

    x..xx.x.xx.x.........x.xx..x.x..x.xx..x.x
                  ^      ^ ^^  ^ ^
                  |      | ||  | |
                random   elements
                cursor    sampled

This PR changes the sampling to pick a new random index to scan for each
new bucket chain that has been sampled. This is more similar to how the
fair random works in dict.c.

Additionally, this PR includes two small optimizations/tuning:

1. The change allows us to lower the sample sizes for fair random
   element selection. There are unit tests to verify the fairness of the
   algorithm.
2. Additionally, the sampling size is decreased even more in very sparse
   tables to prevent it from sampling empty buckets many times which
   makes it slow.

One existing unit test is changed from a stack-allocated to a
heap-allocated array, because it was observed to use too much stack
memory when running locally with the `--large-memory` option.

Fixes #2019.

Signed-off-by: Viktor Söderqvist <[email protected]>
Signed-off-by: Harkrishn Patro <[email protected]>
@hpatro hpatro moved this from To be backported to 8.1.2 in Valkey 8.1 Jun 11, 2025
shanwan1 pushed a commit to shanwan1/valkey that referenced this pull request Jun 13, 2025
In cases when half of the elements in a hashtable have been deleted in
scan order, i.e. using scan and deleting the returned elements, the
random selection algorithm becomes very slow. Also, if a random element
is deleted repeatedly, this can make the hash table to become skewed and
make random selection very slow. Behavior without this fix:

* The fair random algorithm samples some elements by picking a random
  cursor and then scanning from that point until it finds a number of
  elements and then picking a random one of them.
* Now, assume there is an empty sequence of buckets (in scan order).
* When we pick a random cursor and it points into this empty sequence,
  we continue scanning until we find some elements and then pick one of
  these. Therefore, the buckets just after an empty sequence are more
  likely to be picked than other buckets.
* If we repeat SPOP many times, this area just after the empty section
  will be emptied too and the empty section grows, the problem gets
  worse and worse.

Bucket distribution (x = non-empty bucket, . = empty bucket)

    x..xx.x.xx.x.........x.xx..x.x..x.xx..x.x
                  ^      ^ ^^  ^ ^
                  |      | ||  | |
                random   elements
                cursor    sampled

This PR changes the sampling to pick a new random index to scan for each
new bucket chain that has been sampled. This is more similar to how the
fair random works in dict.c.

Additionally, this PR includes two small optimizations/tuning:

1. The change allows us to lower the sample sizes for fair random
   element selection. There are unit tests to verify the fairness of the
   algorithm.
2. Additionally, the sampling size is decreased even more in very sparse
   tables to prevent it from sampling empty buckets many times which
   makes it slow.

One existing unit test is changed from a stack-allocated to a
heap-allocated array, because it was observed to use too much stack
memory when running locally with the `--large-memory` option.

Fixes valkey-io#2019.

Signed-off-by: Viktor Söderqvist <[email protected]>
Signed-off-by: shanwan1 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working performance release-notes This issue should get a line item in the release notes

Projects

Status: 8.1.2

Development

Successfully merging this pull request may close these issues.

[BUG] valkey-server slows down with spop <key> and srandmember <key> spinning in hashtableScanDefrag after a few hours

5 participants