Skip to content

add resize-hashtables-enabled config to control tryResizeHashTables in serverCron()#8611

Open
Axlgrep wants to merge 1 commit intoredis:unstablefrom
Axlgrep:add_resize_hashtables_enabled_config
Open

add resize-hashtables-enabled config to control tryResizeHashTables in serverCron()#8611
Axlgrep wants to merge 1 commit intoredis:unstablefrom
Axlgrep:add_resize_hashtables_enabled_config

Conversation

@Axlgrep
Copy link
Contributor

@Axlgrep Axlgrep commented Mar 6, 2021

recently we discovered a redis-cluster(128 shared) that often reported slow log for simple commands, such as unlink/setex, the key/value are not big, we were confusion that such a simple command cost tens or even hundreds of milliseconds.

after we looked at the monitoring, we found some interesting things:

  1. the number of keys in the redis-cluster fluctuates greatly throughout the day(single node can reach up to 40 million keys a day, but the lowest was just 2.6 million), and the number of keys change trend is the same per day.
  2. we found that the memory of the node would shake violently at certain times of the day(we don't have big Key delete or write).
  3. the time points slowlog reported each day were very similar.

The number of keys change trend in the following figure:

Snipaste_2021-03-06_22-18-26

We realized that it might have something to do with creating Hashtable at the begin of the Rehash, our redis-cluster node may need to allocate and free large chunks of memory, It could be up to 512MB for new hashtable(40 million keys require a table of size 2^26 to store, 2^26 * sizeof(dictEntry*) = 512MB), so we add some log in the code to record the allocate/free memory consumption time:

ZCalloc log:
zcalloc_source

ZFree log:
zfree_source

then we watched for a few days, and sure enough, the slowlog was caused by allocate/free memory:
Setex slowlog:
setex_slowlog

Unlink slowlog:
unlink_slowlog

The detailed time consumption of these days is shown in the following figure:

ZCalloc time cost:
zcalloc_cost

ZFree time cost:
zfree_cost

So I think for this type of instance(the number of keys change trend is the same per day), do we provide the ability to prevent ResizeHashTables(it lead to dictResize/dictExpand again and again every day, cost CPU time and may block too long on allocate/free large blocks of memory)? maybe users don't care about a little extra memory for Hashtables, but performance is what they really care about.

@oranagra
Copy link
Member

oranagra commented Mar 7, 2021

@Axlgrep thanks for the detailed report.
I certainly agree that a user with such use case pattern would prefer the memory cost rather than the latency spikes, but i'm not comfortable with the config (one that's just preventing dicts from being shrinked).
I'd argue the same about a config that let's the user control HASHTABLE_MIN_FILL, it is too low level.
I would rather find some mechanism that can do that automatically in some way without users detecting the problem (which i assume most won't) and tuning their database.

looking at the code i see we use calloc in that case. and i suspect that the latency spike is due to the zeroing of the memory, and not the memory allocation itself.
the zeroing of memory brings latency due to two issues:

  1. obviously it takes time to zero memory (memory bandwidth).
  2. the kernel needs to allocate pages and assign them to the process (i think this makes the most damage).

so if we somehow do the zeroing gradually, we'll avoid the latency spike.

I would like to suggest to modify the code in a way that will not use calloc, instead it can use malloc, and only zero the entries when it writes to them for the first time.
obviously, the code will need to keep some variable that keeps track at the maximum index that was already written to, and avoid trying to read from ones that are still uninitialized.

while at this topic, i would like to raise another improvement idea i had, maybe we can combine these two and code them together:

Currently a reshash come with a big price of memory, which sometimes causes eviction and data loss.
this is because we allocate another hash table that's double in size, and keep the old one in parallel to the new one for a while.
my idea is that instead of allocating a new hash table, we'll re-alloc the existing one, and then start moving entries from the lower half to the upper half. or alternatively when shrinking, we can maybe concentrate the entries in the lower part, and then eventually realloc the hash table to be smaller.
obviously this poses some challenges in dictScan and other areas of dict.c, but i think they're all solvable.

@Axlgrep would you like to look into any of these?

@Axlgrep
Copy link
Contributor Author

Axlgrep commented Mar 7, 2021

@oranagra

For our guess is due to the zeroing of the memory caused by the latency spike, I can modify the source code(when use zalloc to allocate memory, also use zmalloc to allocate another memory of the same size), then over the next few days to compare the latency spike on ZCalloc() and ZMalloc().

However, it is to note that there is also a risk of latency spike when free large chunks of memory(zfree 536870912 bytes memory, cost 133389 us).

@zuiderkwast
Copy link
Contributor

If we're at it: There are other implementations both faster and smaller in memory than the Redis dict. For example "SIMD-friendly bucketized cuckoo hash" (with more explanations found in this discussion) and HAMT. (The former can be re-hashed in place and can be 99% full and still be O(1). The latter doesn't need rehashing at all and doesn't use huge allocations.)

@oranagra
Copy link
Member

oranagra commented Mar 8, 2021

@zuiderkwast can these support our SCAN algorithm?
p.s. we have some long term plan to move away from dict and use rax instead, but this is still far ahead, for now i rather make small incremental changes.

@zuiderkwast
Copy link
Contributor

@oranagra

can these support our SCAN algorithm?

I have to think about that. :-) I suppose it would have to be adapted. Does it work for rax?

p.s. we have some long term plan to move away from dict and use rax instead

Yeah, so I've read, but also that large random-like keys take up more space in rax, e.g. uuids. A HAMT is essentially a trie of the hash value of a key, with a high branching factor, so it's somewhere between a rax and a hash table. (If we don't need it to be ordered, I imagine HAMT is better.)

@oranagra
Copy link
Member

oranagra commented Mar 8, 2021

rax can't support SCAN, if / when we'll do that, we'll need to break compatibility with that command.

@zuiderkwast
Copy link
Contributor

zuiderkwast commented Mar 9, 2021

I think rax can support SCAN in this way: The key space is mapped to the cursor's integer range using the first 4 or 8 bytes of the key (treated as big endian) as the cursor value. The cursor 0 corresponds to keys starting with "\0\0\0\0"...

It implies that all keys with the same 4 or 8 bytes prefix have to be included in the same scan "chunk", so if all (or many) of the keys have the same prefix, SCAN has to return them all to be able to increment the cursor, and it falls back to KEYS behaviour in the worst case.

For the HAMT (which is a trie of the hash values of keys), the cursor can simply be the hash value of a key. The keys are traversed in the order of their hash values.

@oranagra
Copy link
Member

oranagra commented Mar 9, 2021

I'm not sure what you describe about common prefixes for the rax is acceptable.
For the two approaches, we have an issue with the special meaning of a cursor value of 0.

@zuiderkwast
Copy link
Contributor

I'm not sure what you describe about common prefixes for the rax is acceptable.

Well, if there are a lot of keys like "user123456", they would all need to be returned together, so it becomes like KEYS, which is not acceptable. (Only perhaps in combination with deprecating SCAN.)

For the two approaches, we have an issue with the special meaning of a cursor value of 0.

Well, 0 (as input) means start from the first key. The cursor has to always increment so we'll never return 0 again until all keys are scanned, so that's not an issue.

@Axlgrep Sorry for kidnapping this PR! I suppose this side-track is done.

@lsn1994
Copy link
Contributor

lsn1994 commented Jun 9, 2021

Hi @Axlgrep, I have a great interest on memory allocation of dict rehash and also did some research. I'd like to confirm if you use jemalloc when you encountered the issue mentioned?
Thanks in advance!

@zuiderkwast
Copy link
Contributor

zuiderkwast commented Sep 23, 2021

Idea: Would it be possible to do the large ht malloc and free asynchronously in another thread, so that the main thread can continue serving clients? If I understand it right, there is a special arena shared among threads for large allocations (opt.oversize_threshold, 8MiB by default). The main thread would not start incremental rehashing until the allocation thread is done allocating.

[Edit] @oranagra, OK, I've moved this comment to the "perfect implementation of hashtable" issue where this thread has been running lately. (Hard to decide where it best fits. That issue is mainly about the nested has table idea. This PR is about the slow malloc and free, so I thought here might be a good place for the comment. The other place is where the "thread" is now though.)

@oranagra
Copy link
Member

@zuiderkwast did you mean to comment in the other issue? If you did, maybe move your comment and I'll respond there.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants