Skip to content

Smash string similarity#79461

Open
Pelanglene wants to merge 12 commits intoClickHouse:masterfrom
Pelanglene:smash-string-similarity
Open

Smash string similarity#79461
Pelanglene wants to merge 12 commits intoClickHouse:masterfrom
Pelanglene:smash-string-similarity

Conversation

@Pelanglene
Copy link
Copy Markdown

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Added new string similarity functions smashSimilarity and smashSimilarityUTF8 that provide intelligent word-based string comparison with gap handling and subsequence matching.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

You can read more about SMASH here:
P4104 Tang.pdf

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Pelanglene seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@Pelanglene Pelanglene force-pushed the smash-string-similarity branch from f13a101 to 63c7c64 Compare April 23, 2025 10:24
@Pelanglene Pelanglene changed the title [draft] Smash string similarity Smash string similarity Apr 23, 2025
@vdimir vdimir added the can be tested Allows running workflows for external contributors label Apr 23, 2025
@vdimir vdimir requested a review from Copilot April 23, 2025 10:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces two new string similarity functions, smashSimilarity and smashSimilarityUTF8, that perform intelligent word-based comparisons with gap handling and subsequence matching. Key changes include:

  • Addition of documentation for smashSimilarity with usage examples.
  • Addition of documentation for smashSimilarityUTF8 to support multi-byte UTF-8 strings.
  • Updated examples and syntax sections in the string functions documentation.
Files not reviewed (3)
  • src/Functions/FunctionsStringDistance.cpp: Language not supported
  • tests/queries/0_stateless/03370_smash_similarity.reference: Language not supported
  • tests/queries/0_stateless/03370_smash_similarity.sql: Language not supported

@vdimir vdimir self-assigned this Apr 23, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Apr 23, 2025

Workflow [PR], commit [86433e6]

@clickhouse-gh clickhouse-gh bot added the pr-feature Pull request with new product feature label Apr 23, 2025
Copy link
Copy Markdown
Member

@vdimir vdimir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!
I've taken a brief look, but didn't read the algorithm itself carefully yet. I will try to review it more thoroughly later. Please see my initial comments for now.


-- Comparison with other string distance functions
SELECT smashSimilarity('hello world', 'hello there') AS smash;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding cases that are close to the limit or hit the limit, for example:

SELECT smashSimilarity(repeat('a', 500000), repeat('a', 500000));
SELECT smashSimilarity(repeat('a', 500000), repeat('a', 499999) || 'b');

SELECT smashSimilarity(repeat('ab', 1000), repeat('ba', 1000));

Copy link
Copy Markdown
Author

@Pelanglene Pelanglene Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an option to continue executing the tests when there is an exception in the test? I'm trying to add tests that hit the limits, I get an exception in first test and the test execution stops there.

Comment on lines +749 to +751
0.7, // gap_open
0.3, // gap_extend
1.0 // mismatch
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain exact values in commets?

Comment on lines +556 to +564
for (size_t i = 1; i <= haystack_size; ++i)
{
for (size_t j = 1; j <= needle_size; ++j)
{
bool match;
if constexpr (is_utf8)
match = haystack_codepoints[i-1] == needle_codepoints[j-1];
else
match = haystack[i-1] == needle[j-1];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding more detailed comments for the dynamic programming matrices to explain the algorithm's approach

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if it is clear now please

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented May 27, 2025

Dear @vdimir, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants