Skip to content

Increase hash table max size for the uniqCombined() function non String types#7221

Closed
azat wants to merge 1 commit intoClickHouse:masterfrom
azat:uniqCombined-adjust-HashTable-size
Closed

Increase hash table max size for the uniqCombined() function non String types#7221
azat wants to merge 1 commit intoClickHouse:masterfrom
azat:uniqCombined-adjust-HashTable-size

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Oct 8, 2019

Category (leave one):

  • Improvement

Short description (up to few sentences):
Increase hash table max size to 2^14 for the uniqCombined() function for non-String types

Detailed description (optional):

This will give accurate result for uniqCombined() and non-String types
for cardinality up to 1<<14 (before this patch the accurate result was
for cardinality up to 1<<13).

…ng types

This will give accurate result for uniqCombined() and non-String types
for cardinality up to 1<<14 (before this patch the accurate result was
for cardinality up to 1<<13).
@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Oct 8, 2019

It doesn't look correct:

For K = 17:

The size of HLL in bytes:
>>> 2 ** 17 * 6 / 8
98304

The size of hash table with max 2^14 elements will contain 2^15 cells of 32bit size:
>>> 4 * 2 * 2 ** 14
131072

And this is greater than the size of HLL.

@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Oct 8, 2019

How to validate:

SET max_memory_usage = 100000000;

SELECT intDiv(number, 16384) AS k, uniqCombined(number % 16384) FROM numbers(16384000) GROUP BY k;

Each state is using no more than 96KiB.
Please add a test and check whether it fails after these modifications.

@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 8, 2019

The size of hash table with max 2^14 elements will contain 2^15 cells of 32bit size:

>> 4 * 2 * 2 ** 14
131072

Hm, did not though about that fact that HashTable<unsigned int>(1<<14) != HashTable<unsigned long>(1<<13)

Also I checked this in debugger, and got this:

  • 1<<13 with String: getBufferSizeInBytes() == 131072
  • 1<<14 with UInt32: getBufferSizeInBytes() == 262144

That said that even for String hash table can be bigger then HLL

(If I did not missed anything)

@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 8, 2019

SELECT intDiv(number, 16384) AS k, uniqCombined(number % 16384) FROM numbers(16384000) GROUP BY k;

And indeed, this one fails after the patch

@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 8, 2019

Hm, but don't we need care only about on-disk representation? (on disk it should not greater then 65546 bytes -- (1<<14)*4+8+2)

@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Oct 8, 2019

Also I checked this in debugger, and got this:
1<<13 with String: getBufferSizeInBytes() == 131072
1<<14 with UInt32: getBufferSizeInBytes() == 262144
That said that even for String hash table can be bigger then HLL

The hash table is resized (or converted to HLL) when it is reached next power of two (if I remember correctly).
It should be of size 64K for 2^13 - 1 elements.

@alexey-milovidov
Copy link
Copy Markdown
Member

Hm, but don't we need care only about on-disk representation? (on disk it should not greater then 65546 bytes -- (1<<14)*4+8+2)

We care about memory size first. The size on disk (and the size of serialized state transferred over network; and how compressible it will be) also matters. But the sizes in memory must be aligned to each other.

@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 8, 2019

resized when it is reached next power of two

Indeed

It should be of size 64K for 2^13 - 1 elements.

So, the in-memory representation is also significant, and the size of the hash table should be reduced by one, instead of increased by the power of two?

@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Oct 8, 2019

If it is resized and then after adding one next element, converted to HLL, it is waste of memory and CPU and should be fixed.

@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 8, 2019

Okay, thanks, closing

@azat azat closed this Oct 8, 2019
@alexey-milovidov
Copy link
Copy Markdown
Member

Hm, but don't we need care only about on-disk representation? (on disk it should not greater then 65546 bytes -- (1<<14)*4+8+2)

The size is motivated by memory requirements.
On disk representation is a different story (it's less important scenario - AggregatingMergeTrees and external aggregation). BTW, data on disk is always compressed (but compression for UInt32 hash tables as well as HLLs is almost useless).

@alexey-milovidov
Copy link
Copy Markdown
Member

Do you want to add my test just in case? :)

@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 8, 2019

Do you want to add my test just in case? :)

Sure, along with fix for the medium_set_size_max

@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 8, 2019

Do you want to add my test just in case? :)

Due to HashTableGrower::increaseSize makes HashTable grows twice power of two, before 2^23 using 2^13 - 1 does not changes anything, since the buffer grows as follow:

  • 128
  • 129 elements in hashtable -> 1024 # initial, but maxFill() allows only 128 items to be used, due to enormous load factor
  • 513 element in hashtable -> 4096
  • 2048 elements in hashtable -> 16384

So this can be fixed:

  • replacing hashtable with i.e. sparse_hash_set
  • replacing grower to fixed will use maximum memory even for small number of elements
  • introduce another grower, that will grow by the power of 2

@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 8, 2019

introduce another grower, that will grow by the power of 2

@alexey-milovidov something like #7236

@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Oct 8, 2019

replacing hashtable with i.e. sparse_hash_set

Not an option because it will become extremely slow.

replacing grower to fixed will use maximum memory even for small number of elements

Not an option because it will defeat the whole point of combined data structure.
(It is needed for the typical case of GROUP BY huge amount of non-uniform distributed keys like URLs, search phrases, regions... when most of aggregation states are small)

introduce another grower, that will grow by the power of 2

That's the right solution.

azat added a commit to azat/ClickHouse that referenced this pull request Oct 8, 2019
uniqCombined() uses hashtable for medium cardinality, and since
HashTable resize by the power of 2 (well actually HashTableGrower grows
double by the power of 2, hence HashTableGrower::increaseSize() should
be overwritten to change this), with 1<<13 (default for uniqCombined)
and UInt64 HashValueType, the HashTable will takes:

  getBufferSizeInBytes() == 131072

While it should be not greater then sizeof(HLL) ~= 98K, so reduce the
maximum cardinality for hashtable to 1<<12 with UInt64 HashValueType and
to 1<13 with UInt32, overwrite HashTableGrower::increaseSize() and cover
this using max_memory_usage.

Refs: ClickHouse#7221 (comment)

v2: cover uniqCombined() with non-default K
@azat azat deleted the uniqCombined-adjust-HashTable-size branch October 9, 2019 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants