add resize-hashtables-enabled config to control tryResizeHashTables in serverCron()#8611
add resize-hashtables-enabled config to control tryResizeHashTables in serverCron()#8611Axlgrep wants to merge 1 commit intoredis:unstablefrom
Conversation
|
@Axlgrep thanks for the detailed report. looking at the code i see we use
so if we somehow do the zeroing gradually, we'll avoid the latency spike. I would like to suggest to modify the code in a way that will not use calloc, instead it can use malloc, and only zero the entries when it writes to them for the first time. while at this topic, i would like to raise another improvement idea i had, maybe we can combine these two and code them together: Currently a reshash come with a big price of memory, which sometimes causes eviction and data loss. @Axlgrep would you like to look into any of these? |
|
For our guess is due to the zeroing of the memory caused by the latency spike, I can modify the source code(when use zalloc to allocate memory, also use zmalloc to allocate another memory of the same size), then over the next few days to compare the latency spike on ZCalloc() and ZMalloc(). However, it is to note that there is also a risk of latency spike when free large chunks of memory(zfree 536870912 bytes memory, cost 133389 us). |
|
If we're at it: There are other implementations both faster and smaller in memory than the Redis dict. For example "SIMD-friendly bucketized cuckoo hash" (with more explanations found in this discussion) and HAMT. (The former can be re-hashed in place and can be 99% full and still be O(1). The latter doesn't need rehashing at all and doesn't use huge allocations.) |
|
@zuiderkwast can these support our SCAN algorithm? |
I have to think about that. :-) I suppose it would have to be adapted. Does it work for rax?
Yeah, so I've read, but also that large random-like keys take up more space in rax, e.g. uuids. A HAMT is essentially a trie of the hash value of a key, with a high branching factor, so it's somewhere between a rax and a hash table. (If we don't need it to be ordered, I imagine HAMT is better.) |
|
rax can't support SCAN, if / when we'll do that, we'll need to break compatibility with that command. |
|
I think rax can support SCAN in this way: The key space is mapped to the cursor's integer range using the first 4 or 8 bytes of the key (treated as big endian) as the cursor value. The cursor 0 corresponds to keys starting with "\0\0\0\0"... It implies that all keys with the same 4 or 8 bytes prefix have to be included in the same scan "chunk", so if all (or many) of the keys have the same prefix, SCAN has to return them all to be able to increment the cursor, and it falls back to KEYS behaviour in the worst case. For the HAMT (which is a trie of the hash values of keys), the cursor can simply be the hash value of a key. The keys are traversed in the order of their hash values. |
|
I'm not sure what you describe about common prefixes for the rax is acceptable. |
Well, if there are a lot of keys like "user123456", they would all need to be returned together, so it becomes like KEYS, which is not acceptable. (Only perhaps in combination with deprecating SCAN.)
Well, 0 (as input) means start from the first key. The cursor has to always increment so we'll never return 0 again until all keys are scanned, so that's not an issue. @Axlgrep Sorry for kidnapping this PR! I suppose this side-track is done. |
|
Hi @Axlgrep, I have a great interest on memory allocation of dict rehash and also did some research. I'd like to confirm if you use jemalloc when you encountered the issue mentioned? |
|
[Edit] @oranagra, OK, I've moved this comment to the "perfect implementation of hashtable" issue where this thread has been running lately. (Hard to decide where it best fits. That issue is mainly about the nested has table idea. This PR is about the slow malloc and free, so I thought here might be a good place for the comment. The other place is where the "thread" is now though.) |
|
@zuiderkwast did you mean to comment in the other issue? If you did, maybe move your comment and I'll respond there. |
|
|
recently we discovered a redis-cluster(128 shared) that often reported slow log for simple commands, such as unlink/setex, the key/value are not big, we were confusion that such a simple command cost tens or even hundreds of milliseconds.
after we looked at the monitoring, we found some interesting things:
The number of keys change trend in the following figure:
We realized that it might have something to do with creating Hashtable at the begin of the Rehash, our redis-cluster node may need to allocate and free large chunks of memory, It could be up to 512MB for new hashtable(40 million keys require a table of size 2^26 to store, 2^26 * sizeof(dictEntry*) = 512MB), so we add some log in the code to record the allocate/free memory consumption time:
ZCalloc log:

ZFree log:

then we watched for a few days, and sure enough, the slowlog was caused by allocate/free memory:

Setex slowlog:
Unlink slowlog:

The detailed time consumption of these days is shown in the following figure:
ZCalloc time cost:

ZFree time cost:

So I think for this type of instance(the number of keys change trend is the same per day), do we provide the ability to prevent ResizeHashTables(it lead to dictResize/dictExpand again and again every day, cost CPU time and may block too long on allocate/free large blocks of memory)? maybe users don't care about a little extra memory for Hashtables, but performance is what they really care about.