Fix uniqCombined() result for >UINT_MAX values (return UInt64 to avoid overflow)#7213
Conversation
…d overflow) uniqCombined() return type is UInt64, but uniqCombined() uses CombinedCardinalityEstimator, and CombinedCardinalityEstimator::size() return type is UInt32, while the underlying HyperLogLog::size() is UInt64. So after this patch uniqCombined() can be used for >UINT_MAX values, the outcome is not ideal (ClickHouse#2073) but at least sane.
|
Ok. The test: |
|
@azat what is the result of this test on your machine? |
Indeed, but if there were function like
I tested using:
P.S. I used P.P.S. with |
Actually looking into this more and more, and I cannot verify that this will change in-memory/on-disk representation, am I missing something? |
Can you point me to specific line in code? |
|
Hm, changing for other types Tested with: |
But this will break uniqueness property of already added values. Maybe a separate function? Thoughts? |
|
I've prepared another patch that will force |
|
uniqCombined has three states:
The sizes of these data structures are aligned to each other. Hash table (2) can grow only up to the size of HLL (3) and is converted to HLL when becomes too big. Flat array and hash table stores hash values directly. But HLL does not store hash values - it does only store max number of leading/trailing zero bits across some subsets of hash values. For first two states, using small hash function is perfectly fine and can give you better performance. The size of HLL is dependent on the size of hash function. For 32-bit hash function, HLL is using 5 bit cells; for 64-bit hash function, HLL is using 6-bit cells. (BTW, this is the origin of Log-Log name). But you can use 32-bit hash function for HLL with 6-bit cells. In this case, one bit is always unused. If different hash function was used for HLLs of the same size, you can deserialize this HLL but results would be incorrect. The same value will be counted twice in most cases. So, don't change the hash function without changing the function name (or some parameter). If you use 32-bit hash function, you still can count more than 2^32 elements. You only have to apply correction for birthday paradox formula (If I remember correctly, we already have it in uniq and unqCombined). But it will work until about 25B values: #6078 You can radically change uniqCombined function to use 64-bit hash everywhere, but I suggest to make it under another name like |
Interesting, since there is 32-bit hash for non-string types, hash table (for medium cardinality) can be expanded by the power of 2 for non-string types, and this will make the value exact for up to 2^14 elements.
This will break compatibility, or am I missing something?
That patch introduces second parameter for it, but after thinking about this a little bit more, and |
|
PS. after this change. |
…_MAX By default uniqCombined() uses 32-bit hash for all types except String, so this means that you cannot use uniqCombined() with i.e UInt64 and cardinality > UINT_MAX, although you can use uniqCombined(toString()) and this will lead to 64-bit hash, but will not have good performance. So uniqCombined64() had been introduced to fix this. Requires: ClickHouse#7213
Category:
Short description (up to few sentences):
Fix uniqCombined() result for >UINT_MAX values (return UInt64 to avoid overflow)
Detailed description (optional):
uniqCombined() return type is UInt64, but uniqCombined() uses
CombinedCardinalityEstimator, and CombinedCardinalityEstimator::size()
return type is UInt32, while the underlying HyperLogLog::size() is
UInt64.
So after this patch uniqCombined() can be used for >UINT_MAX values, the
outcome is not ideal (#2073) but at least sane.