Fix random element in skewed sparse hash table by zuiderkwast · Pull Request #2085 · valkey-io/valkey

zuiderkwast · 2025-05-14T17:41:14Z

In cases when half of the elements in a hashtable have been deleted in scan order, i.e. using scan and deleting the returned elements, the random selection algorithm becomes very slow. Also, if a random element is deleted repeatedly, this can make the hash table to become skewed and make random selection very slow. Behavior without this fix:

The fair random algorithm samples some elements by picking a random cursor and then scanning from that point until it finds a number of elements and then picking a random one of them.
Now, assume there is an empty sequence of buckets (in scan order).
When we pick a random cursor and it points into this empty sequence, we continue scanning until we find some elements and then pick one of these. Therefore, the buckets just after an empty sequence are more likely to be picked than other buckets.
If we repeat SPOP many times, this area just after the empty section will be emptied too and the empty section grows, the problem gets worse and worse.

Bucket distribution (x = non-empty bucket, . = empty bucket)

x..xx.x.xx.x.........x.xx..x.x..x.xx..x.x
              ^      ^ ^^  ^ ^
              |      | ||  | |
            random   elements
            cursor    sampled

This PR changes the sampling to pick a new random index to scan for each new bucket chain that has been sampled. This is more similar to how the fair random works in dict.c.

Additionally, this PR includes two small optimizations/tuning:

The change allows us to lower the sample sizes for fair random element selection. There are unit tests to verify the fairness of the algorithm.
Additionally, the sampling size is decreased even more in very sparse tables to prevent it from sampling empty buckets many times which makes it slow.

One existing unit test is changed from a stack-allocated to a heap-allocated array, because it was observed to use too much stack memory when running locally with the --large-memory option.

Fixes #2019.

Signed-off-by: Viktor Söderqvist <[email protected]>

codecov · 2025-05-14T17:57:17Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.25%. Comparing base (fe97ba4) to head (e8986f1).
Report is 14 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2085      +/-   ##
============================================
+ Coverage     71.08%   71.25%   +0.17%     
============================================
  Files           122      122              
  Lines         66210    66022     -188     
============================================
- Hits          47065    47044      -21     
+ Misses        19145    18978     -167

Files with missing lines	Coverage Δ
src/hashtable.c	`81.37% <100.00%> (ø)`

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ranshid

Overall looks simple to me.
From what I get from this, this does not fix the issue though right?

src/unit/test_hashtable.c

ranshid · 2025-05-15T14:19:00Z

src/hashtable.c

-    size_t cursor = randomSizeT();
    while (samples.seen < count) {
-        cursor = hashtableScan(ht, cursor, sampleEntriesScanFn, &samples);
+        hashtableScan(ht, randomSizeT(), sampleEntriesScanFn, &samples);


Do you think this will have performance impact on random commands?

No, not really. Memory access is the slow operation here. PRNG is fast compared to that.

IIUC, using a random cursor breaks the SCAN method’s guarantee of uniqueness, risking duplicate key processing. For functions like evict, this could degrade performance or cause logical errors.

This is the code to get some random keys, hashtableSampleEntries, similar to dictGetSomeKeys. It uses the scan implementation at a random cursor because the scan implementation can already handle rehashing and skip indexes that have already been rehashed.

In SCAN we don't use a random cursor.

Eviction does use kvstoreHashtableSampleEntries so you're right, it can now return duplicate entries. It's the same in dictGetSomeKeys. This PR makes hashtable sampling more similar to dictGetSomeKeys.

By default maxmemory_samples is 5, so it basically only samples one or two buckets.

Eviction does use kvstoreHashtableSampleEntries so you're right, it can now return duplicate entries. It's the same in dictGetSomeKeys. This PR makes hashtable sampling more similar to dictGetSomeKeys.

It appears that's a more than acceptable trade-off considering we're basically improving the overall performance (at the very least for sparse hash tables) in here and a high keyspace churn rate (due repeat memory evictions) can evidently also quickly cause the keyspace hash table to become skewed, this is IMHO a net-positive tradeoff we're making.

Running a test with --maxmemory 100mb+--maxmemory-policy allkeys-random and filling the primary kv hashtable with random string entries:
Before this change, the primary kv hash table chain length can easily extend to over 40. After this change, the chain length stays under 10 for most of the time.

Red line is 8.1.1, blue line is 8.1.1 with the PR patch applied:

I've already tried only adjusting FAIR_RANDOM_SAMPLE_SIZE downwards (20, 10, 5, and 1 times of ENTRIES_PER_BUCKET) in my very early investigation when I opened the issue and it only made matters significantly worse. I was easily able to reproduce the issue when adjusting that multiplier downwards in my testing (chain length would skyrocket very quickly) but didn't think it was worth mentioning since it wouldn't show the actual issue on an unmodified code base.

I think the only reason we're adjusting the FAIR_RANDOM_SAMPLE_SIZE multiplier down in here is that it's not necessary to do that many samples anymore after we're now randomizing the cursor during sampling rather than iterating on the same cursor and we therefore won't as easily run into the same randomness bias issue.

In fact, I would even make the argument that with this entire patch applied it may be worth considering completely getting rid of FAIR_RANDOM_SAMPLE_SIZE and only using WEAK_RANDOM_SAMPLE_SIZE (and just rename it to RANDOM_SAMPLE_SIZE) since that also improved spop/srandmember performance by a good amount in my benchmarks without sacrificing too much of the randomness quality.

I suspect the primary reason for the improvement might be adjusting the FAIR_RANDOM_SAMPLE_SIZE. What would be the outcome if we only modified FAIR_RANDOM_SAMPLE_SIZE while keeping the previous hashtableScan implementation unchanged?

@soloestoy Not at all. Only lowering the sample size makes the randomness worse without solving the issue.

I need to explain why this problem arises from the scan-order sampling:

The fair random algorithm samples some elements by picking a random cursor and then scanning from that point until it finds a number of elements and then picking a random one of them.

Now, assume there is an empty sequence (in scan order) of buckets.

When we pick a random cursor and it points into this empty sequence, we continue scanning until we find some elements and then pick one of these. Therefore, the buckets just after an empty sequence are more likely to be picked than other buckets.

If we repeat SPOP many times, this area just after the empty section will be emptied too and the empty section grows, the problem gets worse and worse.

|x.xx.x.xx.x.........x.xx..x.x..x.xx..x.| ^ ^ ^^ ^ ^ | | || | | random elements cursor sampled

Making sampling not follow any scan order solves the issue. The bad algorithm is my mistake. I wrote it when I implemented the hashtable. The dictGetFairRandomKey does just what this PR does: it picks a random bucket every time. I should have followed the dict's design here but I thought it was smart to sample in scan order. But it wasn't.

We can keep the same sample size, but we have a unit test where we actually do some statistical measurement of the fairness. Without this change, that fairness test would fail with a smaller sample size. This change allows us to lower the sample size.

We could probably lower it more, but I didn't want to risk the fairness of it in cases like ongoing rehashing, very sparse tables, etc. Maybe I'm too conservative, but to lower it more, we should tune this more carefully. The unit test code is useful for tuning it.

it's better to add comments to explain the history.

I will improve the PR top comment.

Fusl · 2025-05-15T19:42:06Z

From what I get from this, this does not fix the issue though right?

I just added an update to that issue. It appears that this patch solves the issue we've been running into.

zuiderkwast · 2025-05-15T21:22:09Z

Overall looks simple to me. From what I get from this, this does not fix the issue though right?

In that comment, @Fusl was referring to another patch which was wrong and didn't fix the problem, this one: #2019 (comment)

enjoy-binbin

LGTM.

soloestoy · 2025-05-16T04:08:20Z

I have not yet carefully reviewed the hashtableScanDefrag function, but based on the implementation method of dictGetSomeKeys, I have two questions:

Does hashtableScanDefrag quickly skip the rehashidx when encountering an ongoing rehash during the scan?
For sparse hash tables, are there additional termination conditions (e.g., exiting the scan if a certain proportion of empty buckets are accessed), even if the target number of keys has not been reached?"

zuiderkwast · 2025-05-16T07:01:15Z

I have not yet carefully reviewed the hashtableScanDefrag function, but based on the implementation method of dictGetSomeKeys, I have two questions:

Does hashtableScanDefrag quickly skip the rehashidx when encountering an ongoing rehash during the scan?

Yes, it skips indexes in the source table (table 0) if they have already been rehashed.

For sparse hash tables, are there additional termination conditions (e.g., exiting the scan if a certain proportion of empty buckets are accessed), even if the target number of keys has not been reached?"

No, but it does fewer samples when the table is sparse, so it is potentially less fair. This makes it faster but it loops until it finds an entry. This is similar to how dictGetFairRandomKey has a fallback to dictGetRandomKey when dictGetSomeKeys didn't return any elements. dictGetRandomKey is less fair but it loops until it finds an entry.

ranshid

The change LGTM.

Fusl

Not an authoritative PR approval but this LGTM overall, just one note

Fusl · 2025-05-16T20:50:40Z

src/hashtable.c

-    void *samples[FAIR_RANDOM_SAMPLE_SIZE];
-    unsigned count = hashtableSampleEntries(ht, &samples[0], FAIR_RANDOM_SAMPLE_SIZE);
+    /* Sample less if it's very sparse. */
+    size_t num_samples = hashtableSize(ht) >= hashtableBuckets(ht) ? FAIR_RANDOM_SAMPLE_SIZE : WEAK_RANDOM_SAMPLE_SIZE;


This hashtableSize(ht) >= hashtableBuckets(ht) expression in my testing barely ever evaluates to false and when it does, we're typically already (or about to be) in the process of rehashing the hashtable to a smaller size. It doesn't hurt to reduce the sampling rate during this time, just figured it was worth a note in case this logic was intended to be different.

Yeah, I was conservative when I picked this threshold. I didn't want to sample too few elements to get unfair randomness in normal circumstances. This case will kick in at fill factor 14%. If resize policy is set to AVOID, the table can be very sparse: MIN_FILL_PERCENT_HARD is just 3%.

In cases when half of the elements in a hashtable have been deleted in scan order, i.e. using scan and deleting the returned elements, the random selection algorithm becomes very slow. Also, if a random element is deleted repeatedly, this can make the hash table to become skewed and make random selection very slow. Behavior without this fix: * The fair random algorithm samples some elements by picking a random cursor and then scanning from that point until it finds a number of elements and then picking a random one of them. * Now, assume there is an empty sequence of buckets (in scan order). * When we pick a random cursor and it points into this empty sequence, we continue scanning until we find some elements and then pick one of these. Therefore, the buckets just after an empty sequence are more likely to be picked than other buckets. * If we repeat SPOP many times, this area just after the empty section will be emptied too and the empty section grows, the problem gets worse and worse. Bucket distribution (x = non-empty bucket, . = empty bucket) x..xx.x.xx.x.........x.xx..x.x..x.xx..x.x ^ ^ ^^ ^ ^ | | || | | random elements cursor sampled This PR changes the sampling to pick a new random index to scan for each new bucket chain that has been sampled. This is more similar to how the fair random works in dict.c. Additionally, this PR includes two small optimizations/tuning: 1. The change allows us to lower the sample sizes for fair random element selection. There are unit tests to verify the fairness of the algorithm. 2. Additionally, the sampling size is decreased even more in very sparse tables to prevent it from sampling empty buckets many times which makes it slow. One existing unit test is changed from a stack-allocated to a heap-allocated array, because it was observed to use too much stack memory when running locally with the `--large-memory` option. Fixes valkey-io#2019. Signed-off-by: Viktor Söderqvist <[email protected]>

In cases when half of the elements in a hashtable have been deleted in scan order, i.e. using scan and deleting the returned elements, the random selection algorithm becomes very slow. Also, if a random element is deleted repeatedly, this can make the hash table to become skewed and make random selection very slow. Behavior without this fix: * The fair random algorithm samples some elements by picking a random cursor and then scanning from that point until it finds a number of elements and then picking a random one of them. * Now, assume there is an empty sequence of buckets (in scan order). * When we pick a random cursor and it points into this empty sequence, we continue scanning until we find some elements and then pick one of these. Therefore, the buckets just after an empty sequence are more likely to be picked than other buckets. * If we repeat SPOP many times, this area just after the empty section will be emptied too and the empty section grows, the problem gets worse and worse. Bucket distribution (x = non-empty bucket, . = empty bucket) x..xx.x.xx.x.........x.xx..x.x..x.xx..x.x ^ ^ ^^ ^ ^ | | || | | random elements cursor sampled This PR changes the sampling to pick a new random index to scan for each new bucket chain that has been sampled. This is more similar to how the fair random works in dict.c. Additionally, this PR includes two small optimizations/tuning: 1. The change allows us to lower the sample sizes for fair random element selection. There are unit tests to verify the fairness of the algorithm. 2. Additionally, the sampling size is decreased even more in very sparse tables to prevent it from sampling empty buckets many times which makes it slow. One existing unit test is changed from a stack-allocated to a heap-allocated array, because it was observed to use too much stack memory when running locally with the `--large-memory` option. Fixes valkey-io#2019. Signed-off-by: Viktor Söderqvist <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]>

In cases when half of the elements in a hashtable have been deleted in scan order, i.e. using scan and deleting the returned elements, the random selection algorithm becomes very slow. Also, if a random element is deleted repeatedly, this can make the hash table to become skewed and make random selection very slow. Behavior without this fix: * The fair random algorithm samples some elements by picking a random cursor and then scanning from that point until it finds a number of elements and then picking a random one of them. * Now, assume there is an empty sequence of buckets (in scan order). * When we pick a random cursor and it points into this empty sequence, we continue scanning until we find some elements and then pick one of these. Therefore, the buckets just after an empty sequence are more likely to be picked than other buckets. * If we repeat SPOP many times, this area just after the empty section will be emptied too and the empty section grows, the problem gets worse and worse. Bucket distribution (x = non-empty bucket, . = empty bucket) x..xx.x.xx.x.........x.xx..x.x..x.xx..x.x ^ ^ ^^ ^ ^ | | || | | random elements cursor sampled This PR changes the sampling to pick a new random index to scan for each new bucket chain that has been sampled. This is more similar to how the fair random works in dict.c. Additionally, this PR includes two small optimizations/tuning: 1. The change allows us to lower the sample sizes for fair random element selection. There are unit tests to verify the fairness of the algorithm. 2. Additionally, the sampling size is decreased even more in very sparse tables to prevent it from sampling empty buckets many times which makes it slow. One existing unit test is changed from a stack-allocated to a heap-allocated array, because it was observed to use too much stack memory when running locally with the `--large-memory` option. Fixes #2019. Signed-off-by: Viktor Söderqvist <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]>

In cases when half of the elements in a hashtable have been deleted in scan order, i.e. using scan and deleting the returned elements, the random selection algorithm becomes very slow. Also, if a random element is deleted repeatedly, this can make the hash table to become skewed and make random selection very slow. Behavior without this fix: * The fair random algorithm samples some elements by picking a random cursor and then scanning from that point until it finds a number of elements and then picking a random one of them. * Now, assume there is an empty sequence of buckets (in scan order). * When we pick a random cursor and it points into this empty sequence, we continue scanning until we find some elements and then pick one of these. Therefore, the buckets just after an empty sequence are more likely to be picked than other buckets. * If we repeat SPOP many times, this area just after the empty section will be emptied too and the empty section grows, the problem gets worse and worse. Bucket distribution (x = non-empty bucket, . = empty bucket) x..xx.x.xx.x.........x.xx..x.x..x.xx..x.x ^ ^ ^^ ^ ^ | | || | | random elements cursor sampled This PR changes the sampling to pick a new random index to scan for each new bucket chain that has been sampled. This is more similar to how the fair random works in dict.c. Additionally, this PR includes two small optimizations/tuning: 1. The change allows us to lower the sample sizes for fair random element selection. There are unit tests to verify the fairness of the algorithm. 2. Additionally, the sampling size is decreased even more in very sparse tables to prevent it from sampling empty buckets many times which makes it slow. One existing unit test is changed from a stack-allocated to a heap-allocated array, because it was observed to use too much stack memory when running locally with the `--large-memory` option. Fixes valkey-io#2019. Signed-off-by: Viktor Söderqvist <[email protected]> Signed-off-by: shanwan1 <[email protected]>

Fix random element in skewed sparse hash table

e8986f1

Signed-off-by: Viktor Söderqvist <[email protected]>

zuiderkwast mentioned this pull request May 14, 2025

[BUG] valkey-server slows down with spop <key> and srandmember <key> spinning in hashtableScanDefrag after a few hours #2019

Closed

zuiderkwast requested review from madolson and ranshid May 15, 2025 09:33

zuiderkwast added this to Valkey 8.1 May 15, 2025

zuiderkwast added the release-notes This issue should get a line item in the release notes label May 15, 2025

zuiderkwast requested review from enjoy-binbin and soloestoy May 15, 2025 09:37

ranshid reviewed May 15, 2025

View reviewed changes

enjoy-binbin approved these changes May 16, 2025

View reviewed changes

ranshid approved these changes May 16, 2025

View reviewed changes

Fusl approved these changes May 16, 2025

View reviewed changes

zuiderkwast moved this to In Progress in Valkey 8.1 May 19, 2025

soloestoy approved these changes May 19, 2025

View reviewed changes

zuiderkwast merged commit 45d03e6 into valkey-io:unstable May 19, 2025
51 checks passed

github-project-automation bot moved this from In Progress to To be backported in Valkey 8.1 May 19, 2025

zuiderkwast deleted the random-sparse branch May 19, 2025 09:08

zuiderkwast added performance bug Something isn't working labels May 21, 2025

zuiderkwast mentioned this pull request May 22, 2025

Fix unit tests for 32-bit builds #2124

Merged

hpatro moved this from To be backported to 8.1.2 in Valkey 8.1 Jun 11, 2025

Conversation

zuiderkwast commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ranshid left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ranshid May 15, 2025

Choose a reason for hiding this comment

Uh oh!

zuiderkwast May 15, 2025

Choose a reason for hiding this comment

Uh oh!

soloestoy May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuiderkwast May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fusl May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fusl May 19, 2025

Choose a reason for hiding this comment

Uh oh!

Fusl May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuiderkwast May 19, 2025

Choose a reason for hiding this comment

Uh oh!

soloestoy May 19, 2025

Choose a reason for hiding this comment

Uh oh!

zuiderkwast May 19, 2025

Choose a reason for hiding this comment

Uh oh!

Fusl commented May 15, 2025

Uh oh!

zuiderkwast commented May 15, 2025

Uh oh!

enjoy-binbin left a comment

Choose a reason for hiding this comment

Uh oh!

soloestoy commented May 16, 2025

Uh oh!

zuiderkwast commented May 16, 2025

Uh oh!

ranshid left a comment

Choose a reason for hiding this comment

Uh oh!

Fusl left a comment

Choose a reason for hiding this comment

Uh oh!

Fusl May 16, 2025

Choose a reason for hiding this comment

Uh oh!

zuiderkwast May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zuiderkwast commented May 14, 2025 •

edited

Loading

codecov bot commented May 14, 2025 •

edited

Loading

soloestoy May 16, 2025 •

edited

Loading

zuiderkwast May 16, 2025 •

edited

Loading

Fusl May 16, 2025 •

edited

Loading

Fusl May 19, 2025 •

edited

Loading

zuiderkwast May 19, 2025 •

edited

Loading