Conversation
|
Pelanglene seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
f13a101 to
63c7c64
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR introduces two new string similarity functions, smashSimilarity and smashSimilarityUTF8, that perform intelligent word-based comparisons with gap handling and subsequence matching. Key changes include:
- Addition of documentation for smashSimilarity with usage examples.
- Addition of documentation for smashSimilarityUTF8 to support multi-byte UTF-8 strings.
- Updated examples and syntax sections in the string functions documentation.
Files not reviewed (3)
- src/Functions/FunctionsStringDistance.cpp: Language not supported
- tests/queries/0_stateless/03370_smash_similarity.reference: Language not supported
- tests/queries/0_stateless/03370_smash_similarity.sql: Language not supported
vdimir
left a comment
There was a problem hiding this comment.
Thanks for the contribution!
I've taken a brief look, but didn't read the algorithm itself carefully yet. I will try to review it more thoroughly later. Please see my initial comments for now.
|
|
||
| -- Comparison with other string distance functions | ||
| SELECT smashSimilarity('hello world', 'hello there') AS smash; | ||
|
|
There was a problem hiding this comment.
Consider adding cases that are close to the limit or hit the limit, for example:
SELECT smashSimilarity(repeat('a', 500000), repeat('a', 500000));
SELECT smashSimilarity(repeat('a', 500000), repeat('a', 499999) || 'b');
SELECT smashSimilarity(repeat('ab', 1000), repeat('ba', 1000));There was a problem hiding this comment.
Is there an option to continue executing the tests when there is an exception in the test? I'm trying to add tests that hit the limits, I get an exception in first test and the test execution stops there.
| 0.7, // gap_open | ||
| 0.3, // gap_extend | ||
| 1.0 // mismatch |
There was a problem hiding this comment.
Could you please explain exact values in commets?
| for (size_t i = 1; i <= haystack_size; ++i) | ||
| { | ||
| for (size_t j = 1; j <= needle_size; ++j) | ||
| { | ||
| bool match; | ||
| if constexpr (is_utf8) | ||
| match = haystack_codepoints[i-1] == needle_codepoints[j-1]; | ||
| else | ||
| match = haystack[i-1] == needle[j-1]; |
There was a problem hiding this comment.
Consider adding more detailed comments for the dynamic programming matrices to explain the algorithm's approach
There was a problem hiding this comment.
Check if it is clear now please
|
Dear @vdimir, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Added new string similarity functions smashSimilarity and smashSimilarityUTF8 that provide intelligent word-based string comparison with gap handling and subsequence matching.
Documentation entry for user-facing changes
You can read more about SMASH here:
P4104 Tang.pdf