Skip to content

Fix uniqCombined() result for >UINT_MAX values (return UInt64 to avoid overflow)#7213

Merged
alexey-milovidov merged 1 commit intoClickHouse:masterfrom
azat:fix-uniqCombined-approximation
Oct 8, 2019
Merged

Fix uniqCombined() result for >UINT_MAX values (return UInt64 to avoid overflow)#7213
alexey-milovidov merged 1 commit intoClickHouse:masterfrom
azat:fix-uniqCombined-approximation

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Oct 7, 2019

Category:

  • Bug Fix

Short description (up to few sentences):

Fix uniqCombined() result for >UINT_MAX values (return UInt64 to avoid overflow)

Detailed description (optional):

uniqCombined() return type is UInt64, but uniqCombined() uses
CombinedCardinalityEstimator, and CombinedCardinalityEstimator::size()
return type is UInt32, while the underlying HyperLogLog::size() is
UInt64.

So after this patch uniqCombined() can be used for >UINT_MAX values, the
outcome is not ideal (#2073) but at least sane.

…d overflow)

uniqCombined() return type is UInt64, but uniqCombined() uses
CombinedCardinalityEstimator, and CombinedCardinalityEstimator::size()
return type is UInt32, while the underlying HyperLogLog::size() is
UInt64.

So after this patch uniqCombined() can be used for >UINT_MAX values, the
outcome is not ideal (ClickHouse#2073) but at least sane.
@alexey-milovidov
Copy link
Copy Markdown
Member

Ok. The test: SELECT uniqCombined(number) FROM numbers_mt(10000000000)
will not be included in CI, because it's slow.

@alexey-milovidov
Copy link
Copy Markdown
Member

@azat what is the result of this test on your machine?

@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 7, 2019

Ok. The test: SELECT uniqCombined(number) FROM numbers_mt(10000000000)
will not be included in CI, because it's slow.

Indeed, but if there were function like writeAggregation/readAggregation (the same as finalizeAggregation) this can be possible

what is the result of this test on your machine?

I tested using:

select uniq from (select uniqCombined(toString(number)) uniq from (select number from system.numbers limit 1e10))
  • before: 1 427 486 203
  • after for uint64 as a hash in HLL(String ch type): 10 017 420 795
  • after for uint32 as hash in HLL (UInt64 ch type): ~5 545 308 725

P.S. I used toString here to force uint64 type (this looks odd to me that uint64 only for string but this is another story)

P.P.S. with numbers_mt it tooks 96.749 sec on my laptop.

@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 7, 2019

this looks odd to me that uint64 only for string but this is another story

Actually looking into this more and more, and I cannot verify that this will change in-memory/on-disk representation, am I missing something?

@alexey-milovidov
Copy link
Copy Markdown
Member

I used toString here to force uint64 type (this looks odd to me that uint64 only for string but this is another story)

Can you point me to specific line in code?

@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 7, 2019

Can you point me to specific line in code?

@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 7, 2019

Actually looking into this more and more, and I cannot verify that this will change in-memory/on-disk representation, am I missing something?

Hm, changing for other types HashValueType to uint64 does not break reading HLL from the disk, will you mind if I will change this?

Tested with:

create table stats (items UInt64, uniq AggregateFunction(uniqCombined, String)) engine=TinyLog()
insert into stats select 1e10 as items, uniq from (select uniqCombinedState(toString(number)) uniq from (select number from system.numbers limit 1e10))
select items, uniqCombinedMerge(uniq) uniq from stats group by items
┌───────items─┬────────uniq─┬
│ 10000000000 │ 10017420795 │
└─────────────┴─────────────┴
# restart ch and change the typ
# same output

@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 7, 2019

does not break reading HLL from the disk

But this will break uniqueness property of already added values. Maybe a separate function? Thoughts?

@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 7, 2019

I've prepared another patch that will force uniqCombined to use 64-bit hash, once this will be resolved

@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Oct 7, 2019

uniqCombined has three states:

  1. Small flat array.
  2. Hash table.
  3. HLL.

The sizes of these data structures are aligned to each other. Hash table (2) can grow only up to the size of HLL (3) and is converted to HLL when becomes too big.

Flat array and hash table stores hash values directly. But HLL does not store hash values - it does only store max number of leading/trailing zero bits across some subsets of hash values.

For first two states, using small hash function is perfectly fine and can give you better performance.
For HLL, it's better to use 64bit hash function. But as you need the value of hash function to convert from hash table to HLL, you cannot use different hashes.

The size of HLL is dependent on the size of hash function. For 32-bit hash function, HLL is using 5 bit cells; for 64-bit hash function, HLL is using 6-bit cells. (BTW, this is the origin of Log-Log name).

But you can use 32-bit hash function for HLL with 6-bit cells. In this case, one bit is always unused.

If different hash function was used for HLLs of the same size, you can deserialize this HLL but results would be incorrect. The same value will be counted twice in most cases. So, don't change the hash function without changing the function name (or some parameter).

If you use 32-bit hash function, you still can count more than 2^32 elements. You only have to apply correction for birthday paradox formula (If I remember correctly, we already have it in uniq and unqCombined). But it will work until about 25B values: #6078

You can radically change uniqCombined function to use 64-bit hash everywhere, but I suggest to make it under another name like uniqEnhanced.

@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 8, 2019

Hash table.

Interesting, since there is 32-bit hash for non-string types, hash table (for medium cardinality) can be expanded by the power of 2 for non-string types, and this will make the value exact for up to 2^14 elements.

You can radically change uniqCombined function to use 64-bit hash everywhere

This will break compatibility, or am I missing something?

but I suggest to make it under another name like uniqEnhanced.

That patch introduces second parameter for it, but after thinking about this a little bit more, and uniqCombined64 came to my mind (I did not want to reserve uniqEnhanced to leave it for #2073)

@alexey-milovidov
Copy link
Copy Markdown
Member

PS.

:) SELECT uniqCombined(reinterpretAsString(number)) FROM numbers_mt(10000000000)

SELECT uniqCombined(reinterpretAsString(number))
FROM numbers_mt(10000000000)

┌─uniqCombined(reinterpretAsString(number))─┐
│                                9990159598 │                                                                                                                                                                                                                                                                                                                                                                                                               
└───────────────────────────────────────────┘

after this change.

@alexey-milovidov alexey-milovidov merged commit d8b3db6 into ClickHouse:master Oct 8, 2019
azat added a commit to azat/ClickHouse that referenced this pull request Oct 8, 2019
…_MAX

By default uniqCombined() uses 32-bit hash for all types except String,
so this means that you cannot use uniqCombined() with i.e UInt64 and
cardinality > UINT_MAX, although you can use uniqCombined(toString())
and this will lead to 64-bit hash, but will not have good performance.

So uniqCombined64() had been introduced to fix this.

Requires: ClickHouse#7213
@azat azat deleted the fix-uniqCombined-approximation branch October 8, 2019 20:17
@akuzm akuzm added the pr-improvement Pull request with some product improvements label Oct 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants