Skip to content

add soundex function, issue #39880#48567

Merged
rschu1ze merged 22 commits intoClickHouse:masterfrom
FriendLey:test_soundex
Apr 13, 2023
Merged

add soundex function, issue #39880#48567
rschu1ze merged 22 commits intoClickHouse:masterfrom
FriendLey:test_soundex

Conversation

@FriendLey
Copy link
Copy Markdown
Contributor

@FriendLey FriendLey commented Apr 9, 2023

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add soundex function. Closes #39880

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 9, 2023

CLA assistant check
All committers have signed the CLA.

@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-feature Pull request with new product feature label Apr 9, 2023
@ucasfl ucasfl added the can be tested Allows running workflows for external contributors label Apr 9, 2023
@ucasfl
Copy link
Copy Markdown
Collaborator

ucasfl commented Apr 11, 2023

LGTM. @alexey-milovidov could you take a look at @FriendLey first pull request?

@Avogar Avogar self-assigned this Apr 11, 2023
@Avogar

This comment was marked as outdated.

@ucasfl

This comment was marked as outdated.


REGISTER_FUNCTION(Soundex)
{
factory.registerFunction<FunctionStringHashFixedString<SoundexImpl>>(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inheriting from the base class for fixed string hashes is a bit weird, to be honest. A) Soundex isn't a hash function B) It is much more common to generate Strings (not: FixedString) from a function. Imho, it would be better to either copy-paste the few functions from FunctionStringHashFixedString to here and inherit directly from IFunction or look for better base class to inherit from (tip: check what function lower() is doing, src/Functions/lower.cpp).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Wikipedia, A hash function is any function that can be used to map data of arbitrary size to fixed-size values. Soundex indexing names to strings with fixed 4 lengths, it's like a type of hash function, So I directly reuse FunctionStringHashFixedString.

wiki: https://en.wikipedia.org/wiki/Hash_function

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if the definition is taken literally, soundex is indeed a hash function. Most people would nevertheless not consider soundex a hash function - it completely lacks the pseudo-randomness of the output which is inherent to hashing.

Producing a FixedString has a bigger problem though: FixedString is mostly used as space optimization in ClickHouse (consecutive strings need no delimiter between them). Only few functions are actually able to process them. This means that soundex in it's current form will not be composable with other functions. If we would produce data type String this problem would not exist.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your patient explanation. I approve, FixedString is really not a good output format.

@rschu1ze
Copy link
Copy Markdown
Member

If you don't mind, I did some refactorings for consistency with the overall codebase and other functions.

@FriendLey FriendLey requested review from Avogar, rschu1ze and ucasfl April 13, 2023 06:18
@rschu1ze rschu1ze merged commit 658d6c8 into ClickHouse:master Apr 13, 2023
@rschu1ze
Copy link
Copy Markdown
Member

I just added ClickHouse to the list of DBMS on Wikipedia which implement sounded 😉
--> https://en.wikipedia.org/wiki/Soundex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SOUNDEX function

6 participants