feat: Add stringBytesUniq and stringBytesEntropy functions#79350
feat: Add stringBytesUniq and stringBytesEntropy functions#79350alexey-milovidov merged 12 commits intoClickHouse:masterfrom
Conversation
|
At first glance, the code looks good 💪 |
This is unrelated. |
|
|
||
| **Returned value** | ||
|
|
||
| - The number of distinct bytes in the string. [UInt8](../data-types/int-uint.md). |
There was a problem hiding this comment.
The number of distinct bytes can range from 0 to 256, while the range of the UInt8 data type is just 0 to 255. Let's change to UInt16.
|
|
||
| inline UInt8 countBits(UInt64 x) | ||
| { | ||
| #if defined(__clang__) |
There was a problem hiding this comment.
We only use clang, so there is no problem if we keep only this branch.
There was a problem hiding this comment.
There is std::popcount
There was a problem hiding this comment.
TIL about this in C++20. Using std::popcount instead.
| private: | ||
| static constexpr size_t COUNTERS_SIZE = 256; | ||
| UInt32 counters[COUNTERS_SIZE] = {0}; | ||
| UInt32 current_generation = 0; |
There was a problem hiding this comment.
Btw, was this optimization worth it?
There was a problem hiding this comment.
I didn't benchmark it specifically, but it definitely seems to be worth it because instead of resetting all 256 counters (O(n)) between strings, it uses the bespoke generation counter, which is an O(1) operation. For columns with many sequential strings, this approach will be significantly faster due to lower cost of initialisation for the individual strings.
| UInt32 & counter = counters[byte]; | ||
| if ((counter & generation_mask) != current_generation) | ||
| counter = current_generation; | ||
| counter++; |
There was a problem hiding this comment.
A minor thing: we write prefix increment by default.
|
|
||
| static ResultType process(const char * str, size_t size) | ||
| { | ||
| uint64_t mask[4] = {0}; |
There was a problem hiding this comment.
It's unusual, that UInt64 is used elsewhere, but uint64_t only here.
There was a problem hiding this comment.
Copy-pasta issue. Fixed.
src/Functions/FunctionsStringBytes.h
Outdated
|
|
||
| for (size_t i = 0; i < input_rows_count; ++i) | ||
| { | ||
| const char * str = reinterpret_cast<const char *>(data.data() + (i == 0 ? 0 : offsets[i - 1])); |
There was a problem hiding this comment.
I see we do reinterpret_cast twice: here to char * and there back to UInt8 *. We can remove both.
There was a problem hiding this comment.
Fair. I changed the processing function to take const UInt8 * to simply things.
src/Functions/FunctionsStringBytes.h
Outdated
|
|
||
| for (size_t i = 0; i < input_rows_count; ++i) | ||
| { | ||
| const char * str = reinterpret_cast<const char *>(data.data() + (i == 0 ? 0 : offsets[i - 1])); |
There was a problem hiding this comment.
There is one unusual thing about offsets (PaddedPODArray) - they allow referencing -1th element, so it is safe to always write offsets[i - 1]
There was a problem hiding this comment.
I'd love to read how that works. For now I fixed it.
src/Functions/FunctionsStringBytes.h
Outdated
| for (size_t i = 0; i < input_rows_count; ++i) | ||
| { | ||
| const char * str = reinterpret_cast<const char *>(data.data() + (i == 0 ? 0 : offsets[i - 1])); | ||
| const size_t size = offsets[i] - (i == 0 ? 0 : offsets[i - 1]) - 1; |
There was a problem hiding this comment.
However, idiomatically is to keep the prev_offset variable and change it in the loop.
| @@ -0,0 +1,18 @@ | |||
| SELECT stringBytesUniq('Hello'); | |||
There was a problem hiding this comment.
At least one test with all 256 bytes.
| SELECT stringBytesEntropy('abcABC123'); | ||
| SELECT stringBytesEntropy(toNullable('Hello')); | ||
| SELECT stringBytesEntropy(toLowCardinality('Hello')); | ||
|
|
There was a problem hiding this comment.
At least one test when the string is non constant (there is more than one string in a block).
alexey-milovidov
left a comment
There was a problem hiding this comment.
Add a performance test as well (tests/performance).
| @@ -0,0 +1,142 @@ | |||
| #include <cmath> | |||
There was a problem hiding this comment.
We prefer every function in a separate file, as it opens possibilities to, e.g., link to a file from the docs, provide more precise authorship info, etc. So let's split into two files.
And then the .h file will be unneeded, so we can keep everything inside two .cpp files.
These .cpp files should be named exactly as the function, e.g., stringBytesUniq.cpp
There was a problem hiding this comment.
I understand it will be good to separate the core logic of the function in individual .cpp files, but wouldn't we still need a common .h file to re-use the class FunctionStringBytes for processing? Wouldn't re-implementing them for individual program in .cpp files lead to code repetition?
Sorry if I misunderstood something.
There was a problem hiding this comment.
Yes, you are right. Let's keep FunctionStringBytes in the header file. The .cpp files will contain the corresponding Impl, instantiate the template, and register the function.
There was a problem hiding this comment.
Made this change as well.
|
The integration tests failure seems to be unrelated to the changes; related to |
| \N | ||
| \N | ||
| \N | ||
| 0 |
There was a problem hiding this comment.
This result (0) looks incorrect.
There was a problem hiding this comment.
Ah yes, corner case. I now use UInt16.
src/Functions/stringBytesUniq.cpp
Outdated
|
|
||
| struct StringBytesUniqImpl | ||
| { | ||
| using ResultType = UInt8; |
There was a problem hiding this comment.
Because the type here wasn't changed.
There was a problem hiding this comment.
I now use UInt16.
alexey-milovidov
left a comment
There was a problem hiding this comment.
Looks good!
One ask - could you please compare the performance (on your machine - ok) with and without the optimization with current_generation?
You can do it on the https://github.com/ClickHouse/ClickBench dataset, on the URL, SearchPhrase, Title, MobilePhoneModel columns.
|
Hi @alexey-milovidov, I did some benchmarking. #!/bin/bash -e
./programs/clickhouse-client < create.sql
./programs/clickhouse-client --query "INSERT INTO hits SELECT * FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/hits_compatible/hits.tsv.gz')" --time
TRIES=3
QUERY_NUM=1
cat queries.sql | while read -r query; do
echo -n "["
for i in $(seq 1 $TRIES); do
(./programs/clickhouse-client --time --format=Null --query="$query" --progress 0 ${EXTRA_SETTINGS} 2>&1 |
grep -Eo '^[0-9]+\.[0-9]+$' || echo -n "null") | tr -d '\n'
[[ "$i" != $TRIES ]] && echo -n ", "
done
echo "],"
QUERY_NUM=$((QUERY_NUM + 1))
done
./programs/clickhouse-client --query "SELECT total_bytes FROM system.tables WHERE name = 'hits' AND database = 'default'"Here I then switch b/w optimised and un-optimised version by re-building the server. Here's the implementation I used for the un-optimised version: static ResultType process(const UInt8 * data, size_t size)
{
if (size == 0)
return 0;
std::array<UInt32, 256> counters{};
const UInt8 * end = data + size;
// Count frequency of each byte
for (; data < end; ++data)
counters[*data]++;
Float64 entropy = 0.0;
Float64 total = size;
for (size_t byte = 0; byte < 256; ++byte)
{
UInt32 count = counters[byte];
if (count > 0)
{
double p = static_cast<double>(count) / total;
entropy -= p * std::log2(p);
}
}
return entropy;
}And also, instead of inserting the entire table, I just used 40million rows: Having done all that, here are the results of optimised and unoptimised version for 3 runs: # unoptimised version
~/.../build/benchmark >>> ./benchmark.sh
[2.136, 2.115, 2.284],
[0.240, 0.240, 0.225],
[2.246, 2.026, 2.088],
[0.076, 0.074, 0.078],
5018654101
~/.../build/benchmark >>> ./benchmark.sh
[2.276, 1.969, 2.866],
[0.308, 0.428, 0.275],
[3.662, 3.054, 2.660],
[0.103, 0.122, 0.082],
5018654101
~/.../build/benchmark >>> ./benchmark.sh
[2.228, 2.071, 2.104],
[0.256, 0.236, 0.234],
[2.329, 2.192, 2.645],
[0.088, 0.082, 0.082],
5018654101
# optimised version
~/.../build/benchmark >>> ./benchmark.sh
[2.592, 2.198, 2.237],
[0.287, 0.237, 0.256],
[2.443, 2.304, 2.327],
[0.089, 0.080, 0.085],
5018654101
~/.../build/benchmark >>> ./benchmark.sh
[2.333, 2.151, 2.251],
[0.277, 0.298, 0.274],
[2.410, 2.303, 2.243],
[0.084, 0.081, 0.082],
5018654101
~/.../build/benchmark >>> ./benchmark.sh
[2.270, 2.391, 2.267],
[0.284, 0.243, 0.246],
[2.422, 2.298, 2.300],
[0.082, 0.084, 0.083],
5018654101Seems like the unoptimised version performed better!!?? Not sure if I did something wrong in the entire setup. |
|
Yes, it was suspected because the amount of memory to clear is quite small, but the optimization introduces branches in the code, and then it is more difficult for the compiler to vectorize. Let's switch to the unoptimized version, and we can merge! |
|
@alexey-milovidov can we retest? |
|
Not sure if I'm looking into it right, but the failures seems unrelated? |
Failures are unrelated. CI for the PR anyways seems broken so I tried merging master into this branch to try fix some of these unrelated failures. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add
stringBytesUniqandstringBytesEntropyfunctions to search for possibly random or encrypted data.Documentation entry for user-facing changes
Fixes #79305
Changes:
stringBytesUniq: Counts the number of distinct bytes in a string.stringBytesEntropy: Calculates Shannon's entropy of byte distribution in a string.