MOD-7944: Support multi-byte char terms by nafraf · Pull Request #5391 · RediSearch/RediSearch

nafraf · 2024-12-21T00:55:16Z

Describe the changes in the pull request

A clear and concise description of what the PR is solving, including:

Current:
DefaultNormalize() and rm_strdupcase() don't support multi-byte characters, and the conversion to lowercase of multi-byte characters are generating different terms for the same word with if written with different case.
Change:
We implemented unicode_tolower() which converts multi-byte char strings to lowercase case, it uses the nunicode library that is already used in rune_util.

Function rm_strdupcase() was renamed to rm_normalize(), the conversion to lowercase is done using unicode_tolower().

To avoid breaking changes related to the suggestion dictionary, the old function strToFoldedRunes() was renamed to strToSingleCodepointFoldedRunes() and it is used only for the suggestions.

Outcome:
Fix queries for prefix/contains/suffix using multi-byte characters.
Multi-byte characters are supported in TEXT and TAG fields.
Multi-byte char synonyms are supported.

This PR does not include:
- Multi-byte char stopwords
  - See testStopWords() in tests/pytests/test_multibyte_char_terms.py
- Diacritics removing during normalization
  - See testDiacriticLimitation() in tests/pytests/test_multibyte_char_terms.py
- Changes to multi-byte char suggestions
  - See testSuggestions() in tests/pytests/test_multibyte_char_terms.py

Which additional issues this PR fixes

Main objects this PR modified

...

Mark if applicable

This PR introduces API changes
This PR introduces serialization changes

codecov · 2024-12-22T18:04:13Z

Codecov Report

Attention: Patch coverage is 97.18310% with 4 lines in your changes missing coverage. Please review.

Project coverage is 87.85%. Comparing base (9cb9ae8) to head (ed7af50).
Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
src/query_parser/v2/parser.c	33.33%	2 Missing ⚠️
src/query_param.c	50.00%	1 Missing ⚠️
src/util/strconv.h	96.96%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5391      +/-   ##
==========================================
- Coverage   87.85%   87.85%   -0.01%     
==========================================
  Files         196      196              
  Lines       35337    35409      +72     
==========================================
+ Hits        31047    31108      +61     
- Misses       4290     4301      +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

raz-mon

Very nice 💪
See some comments and suggestions

src/query_parser/v1/parser.y

src/stopwords.c

src/tokenize.c

src/util/strconv.h

tests/pytests/test_multibyte_char_terms.py

src/tokenize.c

src/util/strconv.h

kei-nan

Added some comments, mostly on the use of setlocale

.github/workflows/event-pull_request.yml

src/query_parser/v1/parser.y

src/tokenize.c

DvirDukhan · 2024-12-31T10:15:07Z

src/tokenize.c

  return dst;
 }

+static char *DefaultNormalize(char *s, char *dst, size_t *len) {


what is the difference between this and rm_strdupcase?

There are some differences:

DefaultNormalize():

ignores control characters

does not allocate memory for the destination

rm_strdupcase():

does not handle control characters, because the lexer filter them.

allocate memory for the result

127.0.0.1:6379> FT.CREATE idx schema t text OK # This result in a term "hiworld" 127.0.0.1:6379> hset doc1 t "hi\nworld" (integer) 1 # You need to remove the control char to find it 127.0.0.1:6379> FT.SEARCH idx "hiworld" 1) (integer) 1 2) "doc1" 3) 1) "t" 2) "hi\nworld" # if you try to search with a control character, # the string will be split in two terms 127.0.0.1:6379> FT.SEARCH idx "hi\nworld" 1) (integer) 0 127.0.0.1:6379> FT.explaincli idx "hi\nworld" 1) INTERSECT { 2) UNION { 3) hi 4) +hi(expanded) 5) } 6) UNION { 7) world 8) +world(expanded) 9) } 10) }

tests/pytests/test_multibyte_char_terms.py

redisearch-backport-pull-request · 2025-02-12T17:43:17Z

Backport failed for 2.8, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 2.8
git worktree add -d .worktree/backport-5391-to-2.8 origin/2.8
cd .worktree/backport-5391-to-2.8
git switch --create backport-5391-to-2.8
git cherry-pick -x 43701d8e8a46a2ecc03055d2328f2ef81677f064

redisearch-backport-pull-request · 2025-02-12T17:43:19Z

Backport failed for 2.6, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 2.6
git worktree add -d .worktree/backport-5391-to-2.6 origin/2.6
cd .worktree/backport-5391-to-2.6
git switch --create backport-5391-to-2.6
git cherry-pick -x 43701d8e8a46a2ecc03055d2328f2ef81677f064

redisearch-backport-pull-request · 2025-02-12T17:43:20Z

Backport failed for 2.10, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 2.10
git worktree add -d .worktree/backport-5391-to-2.10 origin/2.10
cd .worktree/backport-5391-to-2.10
git switch --create backport-5391-to-2.10
git cherry-pick -x 43701d8e8a46a2ecc03055d2328f2ef81677f064

* Normalize multi-byte char terms * Fix rm_strdupcase_utf8 * Remove debug log * Add Russian alphabet and diacritic tests * Don't use DefaultNormalize_utf8() for chinese * print available locales * Check locale before using utf8 normalization * Fix DefaultNormalize() * Skip multibyte tests if 'en_US.UTF-8' locale is not available * Test multibyte stopwords * Revert changes in install_script.sh * run sanitizer using ubuntu:latest * Test_cn: fix language_field * Remove unused strtolower function from misc.c and misc.h * Fixes from code review * Enhance multibyte character tests * Validate queries using multi-byte stopwords * Add MULTIBYTE_CHARS config param and install locales * Use setlocale instead of querylocale for better compatibility * revert changes in event-pull_request.yml and install French locale for debian/ubuntu * Convert to lowercase using nunicode library * nunicode_tolower() returns zero terminated string * Refactor nunicode_tolower() to improve readability * Refactor nunicode_tolower() to use destination buffer and improve memory management * Remove null termination in nunicode_tolower() * Improve documentation * Support multi-byte chars for tags * Support multi-byte chars for synonyms * unicode_tolower(): avoid duplicating encoded input * Increase SSO_MAX_LENGTH * Update test to use FT.DEBUG DUMP_TAGIDX instead of FT.TAGVALS * Keep previous test_cn:testSynonym and test_cn:testMixedHighlight * Update CMake configuration for consistent multi-byte char sorting * Test JSON index * Fix JSON test and remove unneeded tolower() * Fix query_EvalSingleTagNode() * Refactor string normalization to use case folding instead of lowercase conversion * Refactor tests to use run_command_on_all_shards for setting DEFAULT_DIALECT * BWC: Add function to convert string to single codepoint folded runes * Rename rm_strdupcase() to rm_normalize() * Add stemming test * Refactor string normalization to use lowercase transformation instead of folding. Modify FT.SUGGET flow to be BWC * Refactor filtering functions to use a transformation callback for rune processing. Remove dead code. * Update testDFAFilter to use strToSingleCodepointFoldedRunes for rune processing * Refactor string normalization to use case folding (single codepoint) instead of lowercase transformation * Revert "Refactor string normalization to use case folding (single codepoint) instead of lowercase transformation" This reverts commit ebcf994. * Add case sensitivity option to tag string processing functions * Rename tag_strtofold to tag_strtolower for clarity and update related references to reflect lowercase transformation * Simplify length checks in tag_strtolower and rm_normalize functions * Add documentation after review (cherry picked from commit 43701d8)

redisearch-backport-pull-request · 2025-02-12T17:43:23Z

Successfully created backport PR for 8.0:

[8.0] MOD-7944: Support multi-byte char terms #5642

(cherry picked from commit 43701d8)

MOD-7944: Support multi-byte char terms (#5391) * Normalize multi-byte char terms * Fix rm_strdupcase_utf8 * Remove debug log * Add Russian alphabet and diacritic tests * Don't use DefaultNormalize_utf8() for chinese * print available locales * Check locale before using utf8 normalization * Fix DefaultNormalize() * Skip multibyte tests if 'en_US.UTF-8' locale is not available * Test multibyte stopwords * Revert changes in install_script.sh * run sanitizer using ubuntu:latest * Test_cn: fix language_field * Remove unused strtolower function from misc.c and misc.h * Fixes from code review * Enhance multibyte character tests * Validate queries using multi-byte stopwords * Add MULTIBYTE_CHARS config param and install locales * Use setlocale instead of querylocale for better compatibility * revert changes in event-pull_request.yml and install French locale for debian/ubuntu * Convert to lowercase using nunicode library * nunicode_tolower() returns zero terminated string * Refactor nunicode_tolower() to improve readability * Refactor nunicode_tolower() to use destination buffer and improve memory management * Remove null termination in nunicode_tolower() * Improve documentation * Support multi-byte chars for tags * Support multi-byte chars for synonyms * unicode_tolower(): avoid duplicating encoded input * Increase SSO_MAX_LENGTH * Update test to use FT.DEBUG DUMP_TAGIDX instead of FT.TAGVALS * Keep previous test_cn:testSynonym and test_cn:testMixedHighlight * Update CMake configuration for consistent multi-byte char sorting * Test JSON index * Fix JSON test and remove unneeded tolower() * Fix query_EvalSingleTagNode() * Refactor string normalization to use case folding instead of lowercase conversion * Refactor tests to use run_command_on_all_shards for setting DEFAULT_DIALECT * BWC: Add function to convert string to single codepoint folded runes * Rename rm_strdupcase() to rm_normalize() * Add stemming test * Refactor string normalization to use lowercase transformation instead of folding. Modify FT.SUGGET flow to be BWC * Refactor filtering functions to use a transformation callback for rune processing. Remove dead code. * Update testDFAFilter to use strToSingleCodepointFoldedRunes for rune processing * Refactor string normalization to use case folding (single codepoint) instead of lowercase transformation * Revert "Refactor string normalization to use case folding (single codepoint) instead of lowercase transformation" This reverts commit ebcf994. * Add case sensitivity option to tag string processing functions * Rename tag_strtofold to tag_strtolower for clarity and update related references to reflect lowercase transformation * Simplify length checks in tag_strtolower and rm_normalize functions * Add documentation after review (cherry picked from commit 43701d8) Co-authored-by: nafraf <[email protected]>

(cherry picked from commit 43701d8)

MOD-7944: Support multi-byte char terms (#5391) (cherry picked from commit 43701d8)

* MOD-7944: Support multi-byte char terms (#5391) (cherry picked from commit 43701d8) * Replace ftDebugCmdName() by debug_cmd()

* MOD-7944: Support multi-byte char terms (#5391) (cherry picked from commit 43701d8) * Fix Query_EvalTagPrefixNode to use QN_PREFIX

nafraf added 4 commits December 20, 2024 19:53

Normalize multi-byte char terms

4a7a1a1

Fix rm_strdupcase_utf8

db1a1ee

Remove debug log

da779ec

Add Russian alphabet and diacritic tests

b1de058

nafraf changed the title ~~WIP: Support multi-byte char terms~~ MOD-7944: Support multi-byte char terms Dec 22, 2024

nafraf marked this pull request as ready for review December 22, 2024 17:41

nafraf added 11 commits December 23, 2024 11:43

Merge branch 'master' into nafraf_multibyte-char-terms

cf12970

Merge branch 'master' into nafraf_multibyte-char-terms

417f12b

Don't use DefaultNormalize_utf8() for chinese

b4b2f97

print available locales

f60b220

Check locale before using utf8 normalization

3c85883

Fix DefaultNormalize()

c41c9a7

Skip multibyte tests if 'en_US.UTF-8' locale is not available

40a010c

Test multibyte stopwords

45d790f

Revert changes in install_script.sh

17898b3

run sanitizer using ubuntu:latest

ef74b0d

Test_cn: fix language_field

1a7d33e

nafraf requested review from DvirDukhan and raz-mon December 29, 2024 09:58

raz-mon reviewed Dec 30, 2024

View reviewed changes

nafraf added 2 commits December 30, 2024 09:25

Remove unused strtolower function from misc.c and misc.h

3bcccd0

Fixes from code review

0a31b85

kei-nan reviewed Dec 31, 2024

View reviewed changes

src/tokenize.c Outdated Show resolved Hide resolved

kei-nan reviewed Dec 31, 2024

View reviewed changes

src/tokenize.c Outdated Show resolved Hide resolved

kei-nan reviewed Dec 31, 2024

View reviewed changes

src/util/strconv.h Outdated Show resolved Hide resolved

kei-nan previously requested changes Dec 31, 2024

View reviewed changes

DvirDukhan reviewed Dec 31, 2024

View reviewed changes

nafraf added 2 commits December 31, 2024 15:37

Merge branch 'master' into nafraf_multibyte-char-terms

c1ed74a

Enhance multibyte character tests

14f2c32