`stringBytesUniq` and `stringBytesEntropy` functions

### Company or project name

ClickHouse

### Use case

Searching for possibly random or encrypted data.

### Describe the solution you'd like

`stringBytesUniq` returns the number of distinct bytes found in a string.
Implementation: - a 256-bit mask.

`stringBytesEntropy` returns the (Shannon's) entropy, measured in bits, of a distribution of bytes in a string.

Implementation: - an array of 256 UInt32 counters.
This can be optimized to avoid clearing of this array (resetting it to zero), similarly to the ClearableHashTable. You can use the highest bit of each of UInt32 cells as a "generation identifier". The generation identifier is switched between 0 and 1 for each next string. If we are about to use the cell, but its value belongs to the previous generation, we reset it to zero and reset it to the new generation. Not sure if this optimization worth it.

### Describe alternatives you've considered

_No response_

### Additional context

`stringBytesEntropy` is named this way, because the elements of the measured distribution are bytes. We can add another function, `stringUTF8Entropy`, if needed.

It is named `stringBytesEntropy`, not `stringEntropyBytes`, because in the latter case, someone could think that the returned value is measured in the number of bytes. The returned value of the entropy is the (fractional) number of bits.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`stringBytesUniq` and `stringBytesEntropy` functions #79305

Company or project name

Use case

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

stringBytesUniq and stringBytesEntropy functions #79305

Description

Company or project name

Use case

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`stringBytesUniq` and `stringBytesEntropy` functions #79305