add soundex function, issue #39880#48567
Conversation
Co-authored-by: flynn <[email protected]>
Co-authored-by: flynn <[email protected]>
Co-authored-by: flynn <[email protected]>
Co-authored-by: flynn <[email protected]>
Co-authored-by: flynn <[email protected]>
Co-authored-by: flynn <[email protected]>
|
LGTM. @alexey-milovidov could you take a look at @FriendLey first pull request? |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
src/Functions/soundex.cpp
Outdated
|
|
||
| REGISTER_FUNCTION(Soundex) | ||
| { | ||
| factory.registerFunction<FunctionStringHashFixedString<SoundexImpl>>( |
There was a problem hiding this comment.
Inheriting from the base class for fixed string hashes is a bit weird, to be honest. A) Soundex isn't a hash function B) It is much more common to generate Strings (not: FixedString) from a function. Imho, it would be better to either copy-paste the few functions from FunctionStringHashFixedString to here and inherit directly from IFunction or look for better base class to inherit from (tip: check what function lower() is doing, src/Functions/lower.cpp).
There was a problem hiding this comment.
From Wikipedia, A hash function is any function that can be used to map data of arbitrary size to fixed-size values. Soundex indexing names to strings with fixed 4 lengths, it's like a type of hash function, So I directly reuse FunctionStringHashFixedString.
There was a problem hiding this comment.
Yes, if the definition is taken literally, soundex is indeed a hash function. Most people would nevertheless not consider soundex a hash function - it completely lacks the pseudo-randomness of the output which is inherent to hashing.
Producing a FixedString has a bigger problem though: FixedString is mostly used as space optimization in ClickHouse (consecutive strings need no delimiter between them). Only few functions are actually able to process them. This means that soundex in it's current form will not be composable with other functions. If we would produce data type String this problem would not exist.
There was a problem hiding this comment.
Thank you for your patient explanation. I approve, FixedString is really not a good output format.
|
If you don't mind, I did some refactorings for consistency with the overall codebase and other functions. |
|
I just added ClickHouse to the list of DBMS on Wikipedia which implement sounded 😉 |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add soundex function. Closes #39880
Documentation entry for user-facing changes