-
Notifications
You must be signed in to change notification settings - Fork 8.3k
stringBytesUniq and stringBytesEntropy functions #79305
Description
Company or project name
ClickHouse
Use case
Searching for possibly random or encrypted data.
Describe the solution you'd like
stringBytesUniq returns the number of distinct bytes found in a string.
Implementation: - a 256-bit mask.
stringBytesEntropy returns the (Shannon's) entropy, measured in bits, of a distribution of bytes in a string.
Implementation: - an array of 256 UInt32 counters.
This can be optimized to avoid clearing of this array (resetting it to zero), similarly to the ClearableHashTable. You can use the highest bit of each of UInt32 cells as a "generation identifier". The generation identifier is switched between 0 and 1 for each next string. If we are about to use the cell, but its value belongs to the previous generation, we reset it to zero and reset it to the new generation. Not sure if this optimization worth it.
Describe alternatives you've considered
No response
Additional context
stringBytesEntropy is named this way, because the elements of the measured distribution are bytes. We can add another function, stringUTF8Entropy, if needed.
It is named stringBytesEntropy, not stringEntropyBytes, because in the latter case, someone could think that the returned value is measured in the number of bytes. The returned value of the entropy is the (fractional) number of bits.