feat: Add stringBytesUniq and stringBytesEntropy functions by sachinkumarsingh092 · Pull Request #79350 · ClickHouse/ClickHouse

sachinkumarsingh092 · 2025-04-19T21:24:49Z

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add stringBytesUniq and stringBytesEntropy functions to search for possibly random or encrypted data.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Fixes #79305

Changes:

Added the following two functions along with tests and docs for them:
- stringBytesUniq: Counts the number of distinct bytes in a string.
- stringBytesEntropy : Calculates Shannon's entropy of byte distribution in a string.

CLAassistant · 2025-04-19T21:24:55Z

All committers have signed the CLA.

clickhouse-gh · 2025-04-20T00:28:21Z

Workflow [PR], commit [7e26ad4]

alexey-milovidov · 2025-04-20T00:35:23Z

At first glance, the code looks good 💪
Let's see what CI will tell...

alexey-milovidov · 2025-04-20T03:31:25Z

Logical error: 'Can't set alias of * of Asterisk'.

This is unrelated.

alexey-milovidov · 2025-04-20T03:33:20Z

docs/en/sql-reference/functions/string-functions.md

+
+**Returned value**
+
+- The number of distinct bytes in the string. [UInt8](../data-types/int-uint.md).


The number of distinct bytes can range from 0 to 256, while the range of the UInt8 data type is just 0 to 255. Let's change to UInt16.

alexey-milovidov · 2025-04-20T03:34:05Z

src/Functions/FunctionsStringBytes.cpp

+
+inline UInt8 countBits(UInt64 x)
+{
+#if defined(__clang__)


We only use clang, so there is no problem if we keep only this branch.

There is std::popcount

TIL about this in C++20. Using std::popcount instead.

alexey-milovidov · 2025-04-20T03:35:05Z

src/Functions/FunctionsStringBytes.cpp

+private:
+    static constexpr size_t COUNTERS_SIZE = 256;
+    UInt32 counters[COUNTERS_SIZE] = {0};
+    UInt32 current_generation = 0;


Btw, was this optimization worth it?

I didn't benchmark it specifically, but it definitely seems to be worth it because instead of resetting all 256 counters (O(n)) between strings, it uses the bespoke generation counter, which is an O(1) operation. For columns with many sequential strings, this approach will be significantly faster due to lower cost of initialisation for the individual strings.

alexey-milovidov · 2025-04-20T03:35:42Z

src/Functions/FunctionsStringBytes.cpp

+        UInt32 & counter = counters[byte];
+        if ((counter & generation_mask) != current_generation)
+            counter = current_generation;
+        counter++;


A minor thing: we write prefix increment by default.

alexey-milovidov · 2025-04-20T03:36:38Z

src/Functions/FunctionsStringBytes.cpp

+
+    static ResultType process(const char * str, size_t size)
+    {
+        uint64_t mask[4] = {0};


It's unusual, that UInt64 is used elsewhere, but uint64_t only here.

Copy-pasta issue. Fixed.

alexey-milovidov · 2025-04-20T03:37:33Z

src/Functions/FunctionsStringBytes.h

+
+        for (size_t i = 0; i < input_rows_count; ++i)
+        {
+            const char * str = reinterpret_cast<const char *>(data.data() + (i == 0 ? 0 : offsets[i - 1]));


I see we do reinterpret_cast twice: here to char * and there back to UInt8 *. We can remove both.

Fair. I changed the processing function to take const UInt8 * to simply things.

alexey-milovidov · 2025-04-20T03:38:17Z

src/Functions/FunctionsStringBytes.h

+
+        for (size_t i = 0; i < input_rows_count; ++i)
+        {
+            const char * str = reinterpret_cast<const char *>(data.data() + (i == 0 ? 0 : offsets[i - 1]));


There is one unusual thing about offsets (PaddedPODArray) - they allow referencing -1th element, so it is safe to always write offsets[i - 1]

I'd love to read how that works. For now I fixed it.

alexey-milovidov · 2025-04-20T03:39:04Z

src/Functions/FunctionsStringBytes.h

+        for (size_t i = 0; i < input_rows_count; ++i)
+        {
+            const char * str = reinterpret_cast<const char *>(data.data() + (i == 0 ? 0 : offsets[i - 1]));
+            const size_t size = offsets[i] - (i == 0 ? 0 : offsets[i - 1]) - 1;


However, idiomatically is to keep the prev_offset variable and change it in the loop.

alexey-milovidov · 2025-04-20T03:39:22Z

tests/queries/0_stateless/03442_string_bytes_functions.sql

@@ -0,0 +1,18 @@
+SELECT stringBytesUniq('Hello');


At least one test with all 256 bytes.

alexey-milovidov · 2025-04-20T03:39:57Z

tests/queries/0_stateless/03442_string_bytes_functions.sql

+SELECT stringBytesEntropy('abcABC123');
+SELECT stringBytesEntropy(toNullable('Hello'));
+SELECT stringBytesEntropy(toLowCardinality('Hello'));
+


At least one test when the string is non constant (there is more than one string in a block).

alexey-milovidov

Add a performance test as well (tests/performance).

alexey-milovidov · 2025-04-20T03:42:44Z

src/Functions/FunctionsStringBytes.cpp

@@ -0,0 +1,142 @@
+#include <cmath>


We prefer every function in a separate file, as it opens possibilities to, e.g., link to a file from the docs, provide more precise authorship info, etc. So let's split into two files.

And then the .h file will be unneeded, so we can keep everything inside two .cpp files.

These .cpp files should be named exactly as the function, e.g., stringBytesUniq.cpp

I understand it will be good to separate the core logic of the function in individual .cpp files, but wouldn't we still need a common .h file to re-use the class FunctionStringBytes for processing? Wouldn't re-implementing them for individual program in .cpp files lead to code repetition?
Sorry if I misunderstood something.

Yes, you are right. Let's keep FunctionStringBytes in the header file. The .cpp files will contain the corresponding Impl, instantiate the template, and register the function.

Made this change as well.

sachinkumarsingh092 · 2025-04-21T04:39:12Z

The integration tests failure seems to be unrelated to the changes; related to test_make_clone_covered_by_broken_detached_dir_exists, test_adding_replica, and test_limited_fetches_for_server in different integration tests.
The stateless tests also seemed to be unrelate.
I could be wrong but these all seems to be an intermittent issues. Can we rerun these failing tests?

alexey-milovidov · 2025-04-21T05:45:04Z

tests/queries/0_stateless/03442_string_bytes_functions.reference

+\N
+\N
+\N
+0


This result (0) looks incorrect.

Ah yes, corner case. I now use UInt16.

alexey-milovidov · 2025-04-21T05:45:53Z

src/Functions/stringBytesUniq.cpp

+
+struct StringBytesUniqImpl
+{
+    using ResultType = UInt8;


Because the type here wasn't changed.

I now use UInt16.

alexey-milovidov

Looks good!

One ask - could you please compare the performance (on your machine - ok) with and without the optimization with current_generation?

You can do it on the https://github.com/ClickHouse/ClickBench dataset, on the URL, SearchPhrase, Title, MobilePhoneModel columns.

sachinkumarsingh092 · 2025-04-22T21:56:28Z

Hi @alexey-milovidov, I did some benchmarking.
So I hacked-patched the benchmark.sh script from ClickBench but substituted run.sh inside it to form a consistent script. This is the script I finally used:

#!/bin/bash -e

./programs/clickhouse-client < create.sql

./programs/clickhouse-client --query "INSERT INTO hits SELECT * FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/hits_compatible/hits.tsv.gz')" --time

TRIES=3
QUERY_NUM=1
cat queries.sql | while read -r query; do
    echo -n "["
    for i in $(seq 1 $TRIES); do
        (./programs/clickhouse-client --time --format=Null --query="$query" --progress 0 ${EXTRA_SETTINGS} 2>&1 |
            grep -Eo '^[0-9]+\.[0-9]+$' || echo -n "null") | tr -d '\n'

        [[ "$i" != $TRIES ]] && echo -n ", "
    done
    echo "],"

    QUERY_NUM=$((QUERY_NUM + 1))
done

./programs/clickhouse-client --query "SELECT total_bytes FROM system.tables WHERE name = 'hits' AND database = 'default'"

Here ./programs/clickhouse-client is the locally built client, create.sql creates the hits table and queries.sql looks like this:

SELECT stringBytesUniq(URL), stringBytesEntropy(URL) FROM hits;
SELECT stringBytesUniq(SearchPhrase), stringBytesEntropy(SearchPhrase) FROM hits;
SELECT stringBytesUniq(Title), stringBytesEntropy(Title) FROM hits;
SELECT stringBytesUniq(MobilePhoneModel), stringBytesEntropy(MobilePhoneModel) FROM hits;

I then switch b/w optimised and un-optimised version by re-building the server. Here's the implementation I used for the un-optimised version:

   static ResultType process(const UInt8 * data, size_t size)
    {
        if (size == 0)
            return 0;

        std::array<UInt32, 256> counters{};
        const UInt8 * end = data + size;

        // Count frequency of each byte
        for (; data < end; ++data)
            counters[*data]++;

        Float64 entropy = 0.0;
        Float64 total = size;

        for (size_t byte = 0; byte < 256; ++byte)
        {
            UInt32 count = counters[byte];
            if (count > 0)
            {
                double p = static_cast<double>(count) / total;
                entropy -= p * std::log2(p);
            }
        }

        return entropy;
    }

And also, instead of inserting the entire table, I just used 40million rows:

SELECT count(*)
FROM hits

Query id: cdf12fe1-046d-4936-b6b5-b93760c0e2b8

   ┌──count()─┐
1. │ 40213741 │ -- 40.21 million
   └──────────┘

Having done all that, here are the results of optimised and unoptimised version for 3 runs:

# unoptimised version
~/.../build/benchmark >>> ./benchmark.sh                                                                        
[2.136, 2.115, 2.284],
[0.240, 0.240, 0.225],
[2.246, 2.026, 2.088],
[0.076, 0.074, 0.078],
5018654101
~/.../build/benchmark >>> ./benchmark.sh                                                                      
[2.276, 1.969, 2.866],
[0.308, 0.428, 0.275],
[3.662, 3.054, 2.660],
[0.103, 0.122, 0.082],
5018654101
~/.../build/benchmark >>> ./benchmark.sh                                                                        
[2.228, 2.071, 2.104],
[0.256, 0.236, 0.234],
[2.329, 2.192, 2.645],
[0.088, 0.082, 0.082],
5018654101
                                                       
# optimised version
~/.../build/benchmark >>> ./benchmark.sh                                                                       
[2.592, 2.198, 2.237],
[0.287, 0.237, 0.256],
[2.443, 2.304, 2.327],
[0.089, 0.080, 0.085],
5018654101
~/.../build/benchmark >>> ./benchmark.sh                                                                        
[2.333, 2.151, 2.251],
[0.277, 0.298, 0.274],
[2.410, 2.303, 2.243],
[0.084, 0.081, 0.082],
5018654101
~/.../build/benchmark >>> ./benchmark.sh                                                                       
[2.270, 2.391, 2.267],
[0.284, 0.243, 0.246],
[2.422, 2.298, 2.300],
[0.082, 0.084, 0.083],
5018654101

Seems like the unoptimised version performed better!!?? Not sure if I did something wrong in the entire setup.

alexey-milovidov · 2025-04-23T04:53:35Z

Yes, it was suspected because the amount of memory to clear is quite small, but the optimization introduces branches in the code, and then it is more difficult for the compiler to vectorize.

Let's switch to the unoptimized version, and we can merge!

sachinkumarsingh092 · 2025-04-24T05:48:31Z

@alexey-milovidov can we retest?

sachinkumarsingh092 · 2025-04-28T07:58:40Z

Not sure if I'm looking into it right, but the failures seems unrelated?

bharatnc · 2025-04-28T20:07:30Z

Not sure if I'm looking into it right, but the failures seems unrelated?

Failures are unrelated. CI for the PR anyways seems broken so I tried merging master into this branch to try fix some of these unrelated failures.

src/Functions/stringBytesUniq.cpp

sachinkumarsingh092 added 5 commits April 19, 2025 23:19

Add stringBytesUniq and stringBytesEntropy functions

22afd3f

Minor refactorings and lint fixes

1cc3378

Add docs

0279cdb

Add functional tests

6c4f43b

Fix spellcheck

f27a07c

alexey-milovidov added the can be tested Allows running workflows for external contributors label Apr 20, 2025

clickhouse-gh bot added the pr-feature Pull request with new product feature label Apr 20, 2025

alexey-milovidov reviewed Apr 20, 2025

View reviewed changes

alexey-milovidov self-assigned this Apr 20, 2025

alexey-milovidov reviewed Apr 20, 2025

View reviewed changes

Fixed inline comments

3c59cb1

clickhouse-gh bot added the manual approve Manual approve required to run CI label Apr 20, 2025

sachinkumarsingh092 added 2 commits April 21, 2025 02:42

Added performance tests for string_bytes_function

66a38e8

split stringByteFunctions into individual .cpp files

0e11c2b

sachinkumarsingh092 requested a review from alexey-milovidov April 20, 2025 21:43

alexey-milovidov removed the manual approve Manual approve required to run CI label Apr 20, 2025

alexey-milovidov reviewed Apr 21, 2025

View reviewed changes

Fix return type for stringBytesUniq

767fa6d

clickhouse-gh bot added the manual approve Manual approve required to run CI label Apr 21, 2025

sachinkumarsingh092 requested a review from alexey-milovidov April 21, 2025 10:24

alexey-milovidov removed the manual approve Manual approve required to run CI label Apr 21, 2025

alexey-milovidov approved these changes Apr 21, 2025

View reviewed changes

Remove current-generation optimisations

92962d7

clickhouse-gh bot added the manual approve Manual approve required to run CI label Apr 23, 2025

sachinkumarsingh092 requested a review from alexey-milovidov April 23, 2025 05:58

Merge master into string-bytes-functions

e916901

bharatnc reviewed Apr 28, 2025

View reviewed changes

src/Functions/stringBytesUniq.cpp Outdated Show resolved Hide resolved

fix in-source documentations

7e26ad4

sachinkumarsingh092 requested a review from bharatnc April 29, 2025 08:02

alexey-milovidov approved these changes Apr 29, 2025

View reviewed changes

alexey-milovidov merged commit 44b7e9f into ClickHouse:master Apr 29, 2025
116 of 121 checks passed

robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 29, 2025

rienath mentioned this pull request Jun 23, 2025

Backport #79008 to 25.4: Fix hashing of NULLs, attempt 3 #82413

Merged


		Returned value

		- The number of distinct bytes in the string. [UInt8](../data-types/int-uint.md).

+              \N
+              \N
+              \N
+

Conversation

sachinkumarsingh092 commented Apr 19, 2025

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

CLAassistant commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clickhouse-gh bot commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Apr 20, 2025

Uh oh!

alexey-milovidov commented Apr 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sachinkumarsingh092 Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sachinkumarsingh092 Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sachinkumarsingh092 commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented Apr 19, 2025 •

edited

Loading

clickhouse-gh bot commented Apr 20, 2025 •

edited

Loading

sachinkumarsingh092 Apr 20, 2025 •

edited

Loading

sachinkumarsingh092 Apr 20, 2025 •

edited

Loading

sachinkumarsingh092 commented Apr 21, 2025 •

edited

Loading

alexey-milovidov commented Apr 23, 2025 •

edited

Loading