MOD-7944: Support multi-byte char terms#5391
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5391 +/- ##
==========================================
- Coverage 87.85% 87.85% -0.01%
==========================================
Files 196 196
Lines 35337 35409 +72
==========================================
+ Hits 31047 31108 +61
- Misses 4290 4301 +11 ☔ View full report in Codecov by Sentry. |
raz-mon
left a comment
There was a problem hiding this comment.
Very nice 💪
See some comments and suggestions
kei-nan
left a comment
There was a problem hiding this comment.
Added some comments, mostly on the use of setlocale
src/tokenize.c
Outdated
| return dst; | ||
| } | ||
|
|
||
| static char *DefaultNormalize(char *s, char *dst, size_t *len) { |
There was a problem hiding this comment.
what is the difference between this and rm_strdupcase?
There was a problem hiding this comment.
There are some differences:
DefaultNormalize():
- ignores control characters
- does not allocate memory for the destination
rm_strdupcase():
- does not handle control characters, because the lexer filter them.
- allocate memory for the result
127.0.0.1:6379> FT.CREATE idx schema t text
OK
# This result in a term "hiworld"
127.0.0.1:6379> hset doc1 t "hi\nworld"
(integer) 1
# You need to remove the control char to find it
127.0.0.1:6379> FT.SEARCH idx "hiworld"
1) (integer) 1
2) "doc1"
3) 1) "t"
2) "hi\nworld"
# if you try to search with a control character,
# the string will be split in two terms
127.0.0.1:6379> FT.SEARCH idx "hi\nworld"
1) (integer) 0
127.0.0.1:6379> FT.explaincli idx "hi\nworld"
1) INTERSECT {
2) UNION {
3) hi
4) +hi(expanded)
5) }
6) UNION {
7) world
8) +world(expanded)
9) }
10) }|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 2.8
git worktree add -d .worktree/backport-5391-to-2.8 origin/2.8
cd .worktree/backport-5391-to-2.8
git switch --create backport-5391-to-2.8
git cherry-pick -x 43701d8e8a46a2ecc03055d2328f2ef81677f064 |
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 2.6
git worktree add -d .worktree/backport-5391-to-2.6 origin/2.6
cd .worktree/backport-5391-to-2.6
git switch --create backport-5391-to-2.6
git cherry-pick -x 43701d8e8a46a2ecc03055d2328f2ef81677f064 |
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 2.10
git worktree add -d .worktree/backport-5391-to-2.10 origin/2.10
cd .worktree/backport-5391-to-2.10
git switch --create backport-5391-to-2.10
git cherry-pick -x 43701d8e8a46a2ecc03055d2328f2ef81677f064 |
* Normalize multi-byte char terms * Fix rm_strdupcase_utf8 * Remove debug log * Add Russian alphabet and diacritic tests * Don't use DefaultNormalize_utf8() for chinese * print available locales * Check locale before using utf8 normalization * Fix DefaultNormalize() * Skip multibyte tests if 'en_US.UTF-8' locale is not available * Test multibyte stopwords * Revert changes in install_script.sh * run sanitizer using ubuntu:latest * Test_cn: fix language_field * Remove unused strtolower function from misc.c and misc.h * Fixes from code review * Enhance multibyte character tests * Validate queries using multi-byte stopwords * Add MULTIBYTE_CHARS config param and install locales * Use setlocale instead of querylocale for better compatibility * revert changes in event-pull_request.yml and install French locale for debian/ubuntu * Convert to lowercase using nunicode library * nunicode_tolower() returns zero terminated string * Refactor nunicode_tolower() to improve readability * Refactor nunicode_tolower() to use destination buffer and improve memory management * Remove null termination in nunicode_tolower() * Improve documentation * Support multi-byte chars for tags * Support multi-byte chars for synonyms * unicode_tolower(): avoid duplicating encoded input * Increase SSO_MAX_LENGTH * Update test to use FT.DEBUG DUMP_TAGIDX instead of FT.TAGVALS * Keep previous test_cn:testSynonym and test_cn:testMixedHighlight * Update CMake configuration for consistent multi-byte char sorting * Test JSON index * Fix JSON test and remove unneeded tolower() * Fix query_EvalSingleTagNode() * Refactor string normalization to use case folding instead of lowercase conversion * Refactor tests to use run_command_on_all_shards for setting DEFAULT_DIALECT * BWC: Add function to convert string to single codepoint folded runes * Rename rm_strdupcase() to rm_normalize() * Add stemming test * Refactor string normalization to use lowercase transformation instead of folding. Modify FT.SUGGET flow to be BWC * Refactor filtering functions to use a transformation callback for rune processing. Remove dead code. * Update testDFAFilter to use strToSingleCodepointFoldedRunes for rune processing * Refactor string normalization to use case folding (single codepoint) instead of lowercase transformation * Revert "Refactor string normalization to use case folding (single codepoint) instead of lowercase transformation" This reverts commit ebcf994. * Add case sensitivity option to tag string processing functions * Rename tag_strtofold to tag_strtolower for clarity and update related references to reflect lowercase transformation * Simplify length checks in tag_strtolower and rm_normalize functions * Add documentation after review (cherry picked from commit 43701d8)
|
Successfully created backport PR for |
(cherry picked from commit 43701d8)
MOD-7944: Support multi-byte char terms (#5391) * Normalize multi-byte char terms * Fix rm_strdupcase_utf8 * Remove debug log * Add Russian alphabet and diacritic tests * Don't use DefaultNormalize_utf8() for chinese * print available locales * Check locale before using utf8 normalization * Fix DefaultNormalize() * Skip multibyte tests if 'en_US.UTF-8' locale is not available * Test multibyte stopwords * Revert changes in install_script.sh * run sanitizer using ubuntu:latest * Test_cn: fix language_field * Remove unused strtolower function from misc.c and misc.h * Fixes from code review * Enhance multibyte character tests * Validate queries using multi-byte stopwords * Add MULTIBYTE_CHARS config param and install locales * Use setlocale instead of querylocale for better compatibility * revert changes in event-pull_request.yml and install French locale for debian/ubuntu * Convert to lowercase using nunicode library * nunicode_tolower() returns zero terminated string * Refactor nunicode_tolower() to improve readability * Refactor nunicode_tolower() to use destination buffer and improve memory management * Remove null termination in nunicode_tolower() * Improve documentation * Support multi-byte chars for tags * Support multi-byte chars for synonyms * unicode_tolower(): avoid duplicating encoded input * Increase SSO_MAX_LENGTH * Update test to use FT.DEBUG DUMP_TAGIDX instead of FT.TAGVALS * Keep previous test_cn:testSynonym and test_cn:testMixedHighlight * Update CMake configuration for consistent multi-byte char sorting * Test JSON index * Fix JSON test and remove unneeded tolower() * Fix query_EvalSingleTagNode() * Refactor string normalization to use case folding instead of lowercase conversion * Refactor tests to use run_command_on_all_shards for setting DEFAULT_DIALECT * BWC: Add function to convert string to single codepoint folded runes * Rename rm_strdupcase() to rm_normalize() * Add stemming test * Refactor string normalization to use lowercase transformation instead of folding. Modify FT.SUGGET flow to be BWC * Refactor filtering functions to use a transformation callback for rune processing. Remove dead code. * Update testDFAFilter to use strToSingleCodepointFoldedRunes for rune processing * Refactor string normalization to use case folding (single codepoint) instead of lowercase transformation * Revert "Refactor string normalization to use case folding (single codepoint) instead of lowercase transformation" This reverts commit ebcf994. * Add case sensitivity option to tag string processing functions * Rename tag_strtofold to tag_strtolower for clarity and update related references to reflect lowercase transformation * Simplify length checks in tag_strtolower and rm_normalize functions * Add documentation after review (cherry picked from commit 43701d8) Co-authored-by: nafraf <[email protected]>
(cherry picked from commit 43701d8)
(cherry picked from commit 43701d8)
Describe the changes in the pull request
A clear and concise description of what the PR is solving, including:
Current:
DefaultNormalize()andrm_strdupcase()don't support multi-byte characters, and the conversion to lowercase of multi-byte characters are generating different terms for the same word with if written with different case.Change:
We implemented unicode_tolower() which converts multi-byte char strings to lowercase case, it uses the nunicode library that is already used in rune_util.
Function rm_strdupcase() was renamed to rm_normalize(), the conversion to lowercase is done using unicode_tolower().
To avoid breaking changes related to the suggestion dictionary, the old function strToFoldedRunes() was renamed to strToSingleCodepointFoldedRunes() and it is used only for the suggestions.
Outcome:
Fix queries for prefix/contains/suffix using multi-byte characters.
Multi-byte characters are supported in TEXT and TAG fields.
Multi-byte char synonyms are supported.
This PR does not include:
testStopWords()intests/pytests/test_multibyte_char_terms.pytestDiacriticLimitation()intests/pytests/test_multibyte_char_terms.pytestSuggestions()intests/pytests/test_multibyte_char_terms.pyWhich additional issues this PR fixes
Main objects this PR modified
Mark if applicable