add resize-hashtables-enabled config to control tryResizeHashTables in serverCron() by Axlgrep · Pull Request #8611 · redis/redis

Axlgrep · 2021-03-06T15:21:39Z

recently we discovered a redis-cluster(128 shared) that often reported slow log for simple commands, such as unlink/setex, the key/value are not big, we were confusion that such a simple command cost tens or even hundreds of milliseconds.

after we looked at the monitoring, we found some interesting things:

the number of keys in the redis-cluster fluctuates greatly throughout the day(single node can reach up to 40 million keys a day, but the lowest was just 2.6 million), and the number of keys change trend is the same per day.
we found that the memory of the node would shake violently at certain times of the day(we don't have big Key delete or write).
the time points slowlog reported each day were very similar.

The number of keys change trend in the following figure:

We realized that it might have something to do with creating Hashtable at the begin of the Rehash, our redis-cluster node may need to allocate and free large chunks of memory, It could be up to 512MB for new hashtable(40 million keys require a table of size 2^26 to store, 2^26 * sizeof(dictEntry*) = 512MB), so we add some log in the code to record the allocate/free memory consumption time:

ZCalloc log:

ZFree log:

then we watched for a few days, and sure enough, the slowlog was caused by allocate/free memory:
Setex slowlog:

Unlink slowlog:

The detailed time consumption of these days is shown in the following figure:

ZCalloc time cost:

ZFree time cost:

So I think for this type of instance(the number of keys change trend is the same per day), do we provide the ability to prevent ResizeHashTables(it lead to dictResize/dictExpand again and again every day, cost CPU time and may block too long on allocate/free large blocks of memory)? maybe users don't care about a little extra memory for Hashtables, but performance is what they really care about.

…n serverCron()

oranagra · 2021-03-07T06:46:21Z

@Axlgrep thanks for the detailed report.
I certainly agree that a user with such use case pattern would prefer the memory cost rather than the latency spikes, but i'm not comfortable with the config (one that's just preventing dicts from being shrinked).
I'd argue the same about a config that let's the user control HASHTABLE_MIN_FILL, it is too low level.
I would rather find some mechanism that can do that automatically in some way without users detecting the problem (which i assume most won't) and tuning their database.

looking at the code i see we use calloc in that case. and i suspect that the latency spike is due to the zeroing of the memory, and not the memory allocation itself.
the zeroing of memory brings latency due to two issues:

obviously it takes time to zero memory (memory bandwidth).
the kernel needs to allocate pages and assign them to the process (i think this makes the most damage).

so if we somehow do the zeroing gradually, we'll avoid the latency spike.

I would like to suggest to modify the code in a way that will not use calloc, instead it can use malloc, and only zero the entries when it writes to them for the first time.
obviously, the code will need to keep some variable that keeps track at the maximum index that was already written to, and avoid trying to read from ones that are still uninitialized.

while at this topic, i would like to raise another improvement idea i had, maybe we can combine these two and code them together:

Currently a reshash come with a big price of memory, which sometimes causes eviction and data loss.
this is because we allocate another hash table that's double in size, and keep the old one in parallel to the new one for a while.
my idea is that instead of allocating a new hash table, we'll re-alloc the existing one, and then start moving entries from the lower half to the upper half. or alternatively when shrinking, we can maybe concentrate the entries in the lower part, and then eventually realloc the hash table to be smaller.
obviously this poses some challenges in dictScan and other areas of dict.c, but i think they're all solvable.

@Axlgrep would you like to look into any of these?

Axlgrep · 2021-03-07T15:01:16Z

@oranagra

For our guess is due to the zeroing of the memory caused by the latency spike, I can modify the source code(when use zalloc to allocate memory, also use zmalloc to allocate another memory of the same size), then over the next few days to compare the latency spike on ZCalloc() and ZMalloc().

However, it is to note that there is also a risk of latency spike when free large chunks of memory(zfree 536870912 bytes memory, cost 133389 us).

zuiderkwast · 2021-03-08T18:15:40Z

If we're at it: There are other implementations both faster and smaller in memory than the Redis dict. For example "SIMD-friendly bucketized cuckoo hash" (with more explanations found in this discussion) and HAMT. (The former can be re-hashed in place and can be 99% full and still be O(1). The latter doesn't need rehashing at all and doesn't use huge allocations.)

oranagra · 2021-03-08T18:54:59Z

@zuiderkwast can these support our SCAN algorithm?
p.s. we have some long term plan to move away from dict and use rax instead, but this is still far ahead, for now i rather make small incremental changes.

zuiderkwast · 2021-03-08T20:29:21Z

@oranagra

can these support our SCAN algorithm?

I have to think about that. :-) I suppose it would have to be adapted. Does it work for rax?

p.s. we have some long term plan to move away from dict and use rax instead

Yeah, so I've read, but also that large random-like keys take up more space in rax, e.g. uuids. A HAMT is essentially a trie of the hash value of a key, with a high branching factor, so it's somewhere between a rax and a hash table. (If we don't need it to be ordered, I imagine HAMT is better.)

oranagra · 2021-03-08T20:47:34Z

rax can't support SCAN, if / when we'll do that, we'll need to break compatibility with that command.

zuiderkwast · 2021-03-09T15:07:07Z

I think rax can support SCAN in this way: The key space is mapped to the cursor's integer range using the first 4 or 8 bytes of the key (treated as big endian) as the cursor value. The cursor 0 corresponds to keys starting with "\0\0\0\0"...

It implies that all keys with the same 4 or 8 bytes prefix have to be included in the same scan "chunk", so if all (or many) of the keys have the same prefix, SCAN has to return them all to be able to increment the cursor, and it falls back to KEYS behaviour in the worst case.

For the HAMT (which is a trie of the hash values of keys), the cursor can simply be the hash value of a key. The keys are traversed in the order of their hash values.

oranagra · 2021-03-09T17:57:09Z

I'm not sure what you describe about common prefixes for the rax is acceptable.
For the two approaches, we have an issue with the special meaning of a cursor value of 0.

zuiderkwast · 2021-03-09T18:40:00Z

I'm not sure what you describe about common prefixes for the rax is acceptable.

Well, if there are a lot of keys like "user123456", they would all need to be returned together, so it becomes like KEYS, which is not acceptable. (Only perhaps in combination with deprecating SCAN.)

For the two approaches, we have an issue with the special meaning of a cursor value of 0.

Well, 0 (as input) means start from the first key. The cursor has to always increment so we'll never return 0 again until all keys are scanned, so that's not an issue.

@Axlgrep Sorry for kidnapping this PR! I suppose this side-track is done.

lsn1994 · 2021-06-09T07:57:26Z

Hi @Axlgrep, I have a great interest on memory allocation of dict rehash and also did some research. I'd like to confirm if you use jemalloc when you encountered the issue mentioned?
Thanks in advance!

zuiderkwast · 2021-09-23T19:54:30Z

Idea: Would it be possible to do the large ht malloc and free asynchronously in another thread, so that the main thread can continue serving clients? If I understand it right, there is a special arena shared among threads for large allocations (opt.oversize_threshold, 8MiB by default). The main thread would not start incremental rehashing until the allocation thread is done allocating.

[Edit] @oranagra, OK, I've moved this comment to the "perfect implementation of hashtable" issue where this thread has been running lately. (Hard to decide where it best fits. That issue is mainly about the nested has table idea. This PR is about the slow malloc and free, so I thought here might be a good place for the comment. The other place is where the "thread" is now though.)

oranagra · 2021-09-23T20:13:31Z

@zuiderkwast did you mean to comment in the other issue? If you did, maybe move your comment and I'll respond there.

CLAassistant · 2024-03-24T23:19:33Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

add resize-hashtables-enabled config to control tryResizeHashTables i…

d2a7bce

…n serverCron()

Axlgrep mentioned this pull request Mar 7, 2021

Add hashtable-min-fill config to control ResizeHashTables #8613

Closed

zuiderkwast mentioned this pull request Mar 25, 2021

[NEW] Replace dict's hash table with HAMT #8698

Open

oranagra mentioned this pull request Jul 16, 2021

add dict-ht-initial-size to reduce memory size during dictExpand #2812

Closed

GuoZhaoran mentioned this pull request Sep 17, 2021

A perfect implementation of a hashtable with a very large number of keys #9517

Open

zuiderkwast mentioned this pull request Nov 19, 2022

Investigate having 1 dictionary per slot in CME instead of a key->slot linked list. #10589

Closed

oranagra mentioned this pull request Oct 5, 2023

[NEW] Rehash causing spikes in main thread to 100% #12602

Open

Conversation

Axlgrep commented Mar 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oranagra commented Mar 7, 2021

Uh oh!

Axlgrep commented Mar 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Mar 8, 2021

Uh oh!

oranagra commented Mar 8, 2021

Uh oh!

zuiderkwast commented Mar 8, 2021

Uh oh!

oranagra commented Mar 8, 2021

Uh oh!

zuiderkwast commented Mar 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oranagra commented Mar 9, 2021

Uh oh!

zuiderkwast commented Mar 9, 2021

Uh oh!

lsn1994 commented Jun 9, 2021

Uh oh!

zuiderkwast commented Sep 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oranagra commented Sep 23, 2021

Uh oh!

CLAassistant commented Mar 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Axlgrep commented Mar 6, 2021 •

edited

Loading

Axlgrep commented Mar 7, 2021 •

edited

Loading

zuiderkwast commented Mar 9, 2021 •

edited

Loading

zuiderkwast commented Sep 23, 2021 •

edited

Loading