Increase hash table max size for the uniqCombined() function non String types by azat · Pull Request #7221 · ClickHouse/ClickHouse

azat · 2019-10-08T08:17:37Z

Category (leave one):

Improvement

Short description (up to few sentences):
Increase hash table max size to 2^14 for the uniqCombined() function for non-String types

Detailed description (optional):

This will give accurate result for uniqCombined() and non-String types
for cardinality up to 1<<14 (before this patch the accurate result was
for cardinality up to 1<<13).

…ng types This will give accurate result for uniqCombined() and non-String types for cardinality up to 1<<14 (before this patch the accurate result was for cardinality up to 1<<13).

alexey-milovidov · 2019-10-08T19:31:51Z

It doesn't look correct:

For K = 17:

The size of HLL in bytes:
>>> 2 ** 17 * 6 / 8
98304

The size of hash table with max 2^14 elements will contain 2^15 cells of 32bit size:
>>> 4 * 2 * 2 ** 14
131072

And this is greater than the size of HLL.

alexey-milovidov · 2019-10-08T19:37:44Z

How to validate:

SET max_memory_usage = 100000000;

SELECT intDiv(number, 16384) AS k, uniqCombined(number % 16384) FROM numbers(16384000) GROUP BY k;

Each state is using no more than 96KiB.
Please add a test and check whether it fails after these modifications.

azat · 2019-10-08T19:47:16Z

The size of hash table with max 2^14 elements will contain 2^15 cells of 32bit size:
>> 4 * 2 * 2 ** 14
131072

Hm, did not though about that fact that HashTable<unsigned int>(1<<14) != HashTable<unsigned long>(1<<13)

Also I checked this in debugger, and got this:

1<<13 with String: getBufferSizeInBytes() == 131072
1<<14 with UInt32: getBufferSizeInBytes() == 262144

That said that even for String hash table can be bigger then HLL

(If I did not missed anything)

azat · 2019-10-08T19:48:42Z

SELECT intDiv(number, 16384) AS k, uniqCombined(number % 16384) FROM numbers(16384000) GROUP BY k;

And indeed, this one fails after the patch

azat · 2019-10-08T19:52:20Z

Hm, but don't we need care only about on-disk representation? (on disk it should not greater then 65546 bytes -- (1<<14)*4+8+2)

alexey-milovidov · 2019-10-08T20:13:00Z

Also I checked this in debugger, and got this:
1<<13 with String: getBufferSizeInBytes() == 131072
1<<14 with UInt32: getBufferSizeInBytes() == 262144
That said that even for String hash table can be bigger then HLL

The hash table is resized (or converted to HLL) when it is reached next power of two (if I remember correctly).
It should be of size 64K for 2^13 - 1 elements.

alexey-milovidov · 2019-10-08T20:14:48Z

Hm, but don't we need care only about on-disk representation? (on disk it should not greater then 65546 bytes -- (1<<14)*4+8+2)

We care about memory size first. The size on disk (and the size of serialized state transferred over network; and how compressible it will be) also matters. But the sizes in memory must be aligned to each other.

azat · 2019-10-08T20:15:43Z

resized when it is reached next power of two

Indeed

It should be of size 64K for 2^13 - 1 elements.

So, the in-memory representation is also significant, and the size of the hash table should be reduced by one, instead of increased by the power of two?

alexey-milovidov · 2019-10-08T20:17:18Z

If it is resized and then after adding one next element, converted to HLL, it is waste of memory and CPU and should be fixed.

azat · 2019-10-08T20:20:08Z

Okay, thanks, closing

alexey-milovidov · 2019-10-08T20:20:11Z

Hm, but don't we need care only about on-disk representation? (on disk it should not greater then 65546 bytes -- (1<<14)*4+8+2)

The size is motivated by memory requirements.
On disk representation is a different story (it's less important scenario - AggregatingMergeTrees and external aggregation). BTW, data on disk is always compressed (but compression for UInt32 hash tables as well as HLLs is almost useless).

alexey-milovidov · 2019-10-08T20:20:28Z

Do you want to add my test just in case? :)

azat · 2019-10-08T20:22:03Z

Do you want to add my test just in case? :)

Sure, along with fix for the medium_set_size_max

azat · 2019-10-08T21:43:13Z

Do you want to add my test just in case? :)

Due to HashTableGrower::increaseSize makes HashTable grows twice power of two, before 2^23 using 2^13 - 1 does not changes anything, since the buffer grows as follow:

128
129 elements in hashtable -> 1024 # initial, but maxFill() allows only 128 items to be used, due to enormous load factor
513 element in hashtable -> 4096
2048 elements in hashtable -> 16384

So this can be fixed:

replacing hashtable with i.e. sparse_hash_set
replacing grower to fixed will use maximum memory even for small number of elements
introduce another grower, that will grow by the power of 2

azat · 2019-10-08T22:20:01Z

introduce another grower, that will grow by the power of 2

@alexey-milovidov something like #7236

alexey-milovidov · 2019-10-08T22:46:42Z

replacing hashtable with i.e. sparse_hash_set

Not an option because it will become extremely slow.

replacing grower to fixed will use maximum memory even for small number of elements

Not an option because it will defeat the whole point of combined data structure.
(It is needed for the typical case of GROUP BY huge amount of non-uniform distributed keys like URLs, search phrases, regions... when most of aggregation states are small)

introduce another grower, that will grow by the power of 2

That's the right solution.

uniqCombined() uses hashtable for medium cardinality, and since HashTable resize by the power of 2 (well actually HashTableGrower grows double by the power of 2, hence HashTableGrower::increaseSize() should be overwritten to change this), with 1<<13 (default for uniqCombined) and UInt64 HashValueType, the HashTable will takes: getBufferSizeInBytes() == 131072 While it should be not greater then sizeof(HLL) ~= 98K, so reduce the maximum cardinality for hashtable to 1<<12 with UInt64 HashValueType and to 1<13 with UInt32, overwrite HashTableGrower::increaseSize() and cover this using max_memory_usage. Refs: ClickHouse#7221 (comment) v2: cover uniqCombined() with non-default K

Increase hash table max size for the uniqCombined() function non Stri…

dd9a04e

…ng types This will give accurate result for uniqCombined() and non-String types for cardinality up to 1<<14 (before this patch the accurate result was for cardinality up to 1<<13).

alexey-milovidov added the can be tested label Oct 8, 2019

azat closed this Oct 8, 2019

azat mentioned this pull request Oct 8, 2019

Do not use more then 98K of memory for uniqCombined* #7236

Merged

azat deleted the uniqCombined-adjust-HashTable-size branch October 9, 2019 21:43

Conversation

azat commented Oct 8, 2019

Uh oh!

alexey-milovidov commented Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azat commented Oct 8, 2019

Uh oh!

azat commented Oct 8, 2019

Uh oh!

azat commented Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Oct 8, 2019

Uh oh!

azat commented Oct 8, 2019

Uh oh!

alexey-milovidov commented Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azat commented Oct 8, 2019

Uh oh!

alexey-milovidov commented Oct 8, 2019

Uh oh!

alexey-milovidov commented Oct 8, 2019

Uh oh!

azat commented Oct 8, 2019

Uh oh!

azat commented Oct 8, 2019

Uh oh!

azat commented Oct 8, 2019

Uh oh!

alexey-milovidov commented Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexey-milovidov commented Oct 8, 2019 •

edited

Loading

alexey-milovidov commented Oct 8, 2019 •

edited

Loading

azat commented Oct 8, 2019 •

edited

Loading

alexey-milovidov commented Oct 8, 2019 •

edited

Loading

alexey-milovidov commented Oct 8, 2019 •

edited

Loading

alexey-milovidov commented Oct 8, 2019 •

edited

Loading