Skip to content

stringBytesUniq and stringBytesEntropy functions #79305

@alexey-milovidov

Description

@alexey-milovidov

Company or project name

ClickHouse

Use case

Searching for possibly random or encrypted data.

Describe the solution you'd like

stringBytesUniq returns the number of distinct bytes found in a string.
Implementation: - a 256-bit mask.

stringBytesEntropy returns the (Shannon's) entropy, measured in bits, of a distribution of bytes in a string.

Implementation: - an array of 256 UInt32 counters.
This can be optimized to avoid clearing of this array (resetting it to zero), similarly to the ClearableHashTable. You can use the highest bit of each of UInt32 cells as a "generation identifier". The generation identifier is switched between 0 and 1 for each next string. If we are about to use the cell, but its value belongs to the previous generation, we reset it to zero and reset it to the new generation. Not sure if this optimization worth it.

Describe alternatives you've considered

No response

Additional context

stringBytesEntropy is named this way, because the elements of the measured distribution are bytes. We can add another function, stringUTF8Entropy, if needed.

It is named stringBytesEntropy, not stringEntropyBytes, because in the latter case, someone could think that the returned value is measured in the number of bytes. The returned value of the entropy is the (fractional) number of bits.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions