Skip to content

feat: Add stringBytesUniq and stringBytesEntropy functions#79350

Merged
alexey-milovidov merged 12 commits intoClickHouse:masterfrom
sachinkumarsingh092:string-bytes-functions
Apr 29, 2025
Merged

feat: Add stringBytesUniq and stringBytesEntropy functions#79350
alexey-milovidov merged 12 commits intoClickHouse:masterfrom
sachinkumarsingh092:string-bytes-functions

Conversation

@sachinkumarsingh092
Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add stringBytesUniq and stringBytesEntropy functions to search for possibly random or encrypted data.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Fixes #79305

Changes:

  • Added the following two functions along with tests and docs for them:
    • stringBytesUniq: Counts the number of distinct bytes in a string.
    • stringBytesEntropy : Calculates Shannon's entropy of byte distribution in a string.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 19, 2025

CLA assistant check
All committers have signed the CLA.

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Apr 20, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Apr 20, 2025

Workflow [PR], commit [7e26ad4]

@clickhouse-gh clickhouse-gh bot added the pr-feature Pull request with new product feature label Apr 20, 2025
@alexey-milovidov
Copy link
Copy Markdown
Member

At first glance, the code looks good 💪
Let's see what CI will tell...

@alexey-milovidov
Copy link
Copy Markdown
Member

Logical error: 'Can't set alias of * of Asterisk'.

This is unrelated.


**Returned value**

- The number of distinct bytes in the string. [UInt8](../data-types/int-uint.md).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of distinct bytes can range from 0 to 256, while the range of the UInt8 data type is just 0 to 255. Let's change to UInt16.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


inline UInt8 countBits(UInt64 x)
{
#if defined(__clang__)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only use clang, so there is no problem if we keep only this branch.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is std::popcount

Copy link
Copy Markdown
Contributor Author

@sachinkumarsingh092 sachinkumarsingh092 Apr 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about this in C++20. Using std::popcount instead.

private:
static constexpr size_t COUNTERS_SIZE = 256;
UInt32 counters[COUNTERS_SIZE] = {0};
UInt32 current_generation = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, was this optimization worth it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't benchmark it specifically, but it definitely seems to be worth it because instead of resetting all 256 counters (O(n)) between strings, it uses the bespoke generation counter, which is an O(1) operation. For columns with many sequential strings, this approach will be significantly faster due to lower cost of initialisation for the individual strings.

UInt32 & counter = counters[byte];
if ((counter & generation_mask) != current_generation)
counter = current_generation;
counter++;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor thing: we write prefix increment by default.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


static ResultType process(const char * str, size_t size)
{
uint64_t mask[4] = {0};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unusual, that UInt64 is used elsewhere, but uint64_t only here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy-pasta issue. Fixed.


for (size_t i = 0; i < input_rows_count; ++i)
{
const char * str = reinterpret_cast<const char *>(data.data() + (i == 0 ? 0 : offsets[i - 1]));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we do reinterpret_cast twice: here to char * and there back to UInt8 *. We can remove both.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. I changed the processing function to take const UInt8 * to simply things.


for (size_t i = 0; i < input_rows_count; ++i)
{
const char * str = reinterpret_cast<const char *>(data.data() + (i == 0 ? 0 : offsets[i - 1]));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one unusual thing about offsets (PaddedPODArray) - they allow referencing -1th element, so it is safe to always write offsets[i - 1]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to read how that works. For now I fixed it.

for (size_t i = 0; i < input_rows_count; ++i)
{
const char * str = reinterpret_cast<const char *>(data.data() + (i == 0 ? 0 : offsets[i - 1]));
const size_t size = offsets[i] - (i == 0 ? 0 : offsets[i - 1]) - 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, idiomatically is to keep the prev_offset variable and change it in the loop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,18 @@
SELECT stringBytesUniq('Hello');
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least one test with all 256 bytes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

SELECT stringBytesEntropy('abcABC123');
SELECT stringBytesEntropy(toNullable('Hello'));
SELECT stringBytesEntropy(toLowCardinality('Hello'));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least one test when the string is non constant (there is more than one string in a block).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a performance test as well (tests/performance).

@alexey-milovidov alexey-milovidov self-assigned this Apr 20, 2025
@@ -0,0 +1,142 @@
#include <cmath>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer every function in a separate file, as it opens possibilities to, e.g., link to a file from the docs, provide more precise authorship info, etc. So let's split into two files.

And then the .h file will be unneeded, so we can keep everything inside two .cpp files.

These .cpp files should be named exactly as the function, e.g., stringBytesUniq.cpp

Copy link
Copy Markdown
Contributor Author

@sachinkumarsingh092 sachinkumarsingh092 Apr 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand it will be good to separate the core logic of the function in individual .cpp files, but wouldn't we still need a common .h file to re-use the class FunctionStringBytes for processing? Wouldn't re-implementing them for individual program in .cpp files lead to code repetition?
Sorry if I misunderstood something.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. Let's keep FunctionStringBytes in the header file. The .cpp files will contain the corresponding Impl, instantiate the template, and register the function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change as well.

@clickhouse-gh clickhouse-gh bot added the manual approve Manual approve required to run CI label Apr 20, 2025
@alexey-milovidov alexey-milovidov removed the manual approve Manual approve required to run CI label Apr 20, 2025
@sachinkumarsingh092
Copy link
Copy Markdown
Contributor Author

sachinkumarsingh092 commented Apr 21, 2025

The integration tests failure seems to be unrelated to the changes; related to test_make_clone_covered_by_broken_detached_dir_exists, test_adding_replica, and test_limited_fetches_for_server in different integration tests.
The stateless tests also seemed to be unrelate.
I could be wrong but these all seems to be an intermittent issues. Can we rerun these failing tests?

\N
\N
\N
0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This result (0) looks incorrect.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, corner case. I now use UInt16.


struct StringBytesUniqImpl
{
using ResultType = UInt8;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the type here wasn't changed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now use UInt16.

@clickhouse-gh clickhouse-gh bot added the manual approve Manual approve required to run CI label Apr 21, 2025
@alexey-milovidov alexey-milovidov removed the manual approve Manual approve required to run CI label Apr 21, 2025
Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

One ask - could you please compare the performance (on your machine - ok) with and without the optimization with current_generation?

You can do it on the https://github.com/ClickHouse/ClickBench dataset, on the URL, SearchPhrase, Title, MobilePhoneModel columns.

@sachinkumarsingh092
Copy link
Copy Markdown
Contributor Author

Hi @alexey-milovidov, I did some benchmarking.
So I hacked-patched the benchmark.sh script from ClickBench but substituted run.sh inside it to form a consistent script. This is the script I finally used:

#!/bin/bash -e

./programs/clickhouse-client < create.sql

./programs/clickhouse-client --query "INSERT INTO hits SELECT * FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/hits_compatible/hits.tsv.gz')" --time

TRIES=3
QUERY_NUM=1
cat queries.sql | while read -r query; do
    echo -n "["
    for i in $(seq 1 $TRIES); do
        (./programs/clickhouse-client --time --format=Null --query="$query" --progress 0 ${EXTRA_SETTINGS} 2>&1 |
            grep -Eo '^[0-9]+\.[0-9]+$' || echo -n "null") | tr -d '\n'

        [[ "$i" != $TRIES ]] && echo -n ", "
    done
    echo "],"

    QUERY_NUM=$((QUERY_NUM + 1))
done

./programs/clickhouse-client --query "SELECT total_bytes FROM system.tables WHERE name = 'hits' AND database = 'default'"

Here ./programs/clickhouse-client is the locally built client, create.sql creates the hits table and queries.sql looks like this:

SELECT stringBytesUniq(URL), stringBytesEntropy(URL) FROM hits;
SELECT stringBytesUniq(SearchPhrase), stringBytesEntropy(SearchPhrase) FROM hits;
SELECT stringBytesUniq(Title), stringBytesEntropy(Title) FROM hits;
SELECT stringBytesUniq(MobilePhoneModel), stringBytesEntropy(MobilePhoneModel) FROM hits;

I then switch b/w optimised and un-optimised version by re-building the server. Here's the implementation I used for the un-optimised version:

   static ResultType process(const UInt8 * data, size_t size)
    {
        if (size == 0)
            return 0;

        std::array<UInt32, 256> counters{};
        const UInt8 * end = data + size;

        // Count frequency of each byte
        for (; data < end; ++data)
            counters[*data]++;

        Float64 entropy = 0.0;
        Float64 total = size;

        for (size_t byte = 0; byte < 256; ++byte)
        {
            UInt32 count = counters[byte];
            if (count > 0)
            {
                double p = static_cast<double>(count) / total;
                entropy -= p * std::log2(p);
            }
        }

        return entropy;
    }

And also, instead of inserting the entire table, I just used 40million rows:

SELECT count(*)
FROM hits

Query id: cdf12fe1-046d-4936-b6b5-b93760c0e2b8

   ┌──count()─┐
1. │ 40213741 │ -- 40.21 million
   └──────────┘

Having done all that, here are the results of optimised and unoptimised version for 3 runs:

# unoptimised version
~/.../build/benchmark >>> ./benchmark.sh                                                                        
[2.136, 2.115, 2.284],
[0.240, 0.240, 0.225],
[2.246, 2.026, 2.088],
[0.076, 0.074, 0.078],
5018654101
~/.../build/benchmark >>> ./benchmark.sh                                                                      
[2.276, 1.969, 2.866],
[0.308, 0.428, 0.275],
[3.662, 3.054, 2.660],
[0.103, 0.122, 0.082],
5018654101
~/.../build/benchmark >>> ./benchmark.sh                                                                        
[2.228, 2.071, 2.104],
[0.256, 0.236, 0.234],
[2.329, 2.192, 2.645],
[0.088, 0.082, 0.082],
5018654101
                                                       
# optimised version
~/.../build/benchmark >>> ./benchmark.sh                                                                       
[2.592, 2.198, 2.237],
[0.287, 0.237, 0.256],
[2.443, 2.304, 2.327],
[0.089, 0.080, 0.085],
5018654101
~/.../build/benchmark >>> ./benchmark.sh                                                                        
[2.333, 2.151, 2.251],
[0.277, 0.298, 0.274],
[2.410, 2.303, 2.243],
[0.084, 0.081, 0.082],
5018654101
~/.../build/benchmark >>> ./benchmark.sh                                                                       
[2.270, 2.391, 2.267],
[0.284, 0.243, 0.246],
[2.422, 2.298, 2.300],
[0.082, 0.084, 0.083],
5018654101

Seems like the unoptimised version performed better!!?? Not sure if I did something wrong in the entire setup.

@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Apr 23, 2025

Yes, it was suspected because the amount of memory to clear is quite small, but the optimization introduces branches in the code, and then it is more difficult for the compiler to vectorize.

Let's switch to the unoptimized version, and we can merge!

@clickhouse-gh clickhouse-gh bot added the manual approve Manual approve required to run CI label Apr 23, 2025
@sachinkumarsingh092
Copy link
Copy Markdown
Contributor Author

@alexey-milovidov can we retest?

@sachinkumarsingh092
Copy link
Copy Markdown
Contributor Author

Not sure if I'm looking into it right, but the failures seems unrelated?

@bharatnc
Copy link
Copy Markdown
Contributor

Not sure if I'm looking into it right, but the failures seems unrelated?

Failures are unrelated. CI for the PR anyways seems broken so I tried merging master into this branch to try fix some of these unrelated failures.

@alexey-milovidov alexey-milovidov merged commit 44b7e9f into ClickHouse:master Apr 29, 2025
116 of 121 checks passed
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors manual approve Manual approve required to run CI pr-feature Pull request with new product feature pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

stringBytesUniq and stringBytesEntropy functions

5 participants