Skip to content

MOD-7944: Support multi-byte char terms#5391

Merged
nafraf merged 60 commits intomasterfrom
nafraf_multibyte-char-terms
Feb 12, 2025
Merged

MOD-7944: Support multi-byte char terms#5391
nafraf merged 60 commits intomasterfrom
nafraf_multibyte-char-terms

Conversation

@nafraf
Copy link
Collaborator

@nafraf nafraf commented Dec 21, 2024

Describe the changes in the pull request

A clear and concise description of what the PR is solving, including:

  1. Current:
    DefaultNormalize() and rm_strdupcase() don't support multi-byte characters, and the conversion to lowercase of multi-byte characters are generating different terms for the same word with if written with different case.

  2. Change:
    We implemented unicode_tolower() which converts multi-byte char strings to lowercase case, it uses the nunicode library that is already used in rune_util.

Function rm_strdupcase() was renamed to rm_normalize(), the conversion to lowercase is done using unicode_tolower().

To avoid breaking changes related to the suggestion dictionary, the old function strToFoldedRunes() was renamed to strToSingleCodepointFoldedRunes() and it is used only for the suggestions.

  1. Outcome:
    Fix queries for prefix/contains/suffix using multi-byte characters.
    Multi-byte characters are supported in TEXT and TAG fields.
    Multi-byte char synonyms are supported.

    This PR does not include:

    • Multi-byte char stopwords
      • See testStopWords() in tests/pytests/test_multibyte_char_terms.py
    • Diacritics removing during normalization
      • See testDiacriticLimitation() in tests/pytests/test_multibyte_char_terms.py
    • Changes to multi-byte char suggestions
      • See testSuggestions() in tests/pytests/test_multibyte_char_terms.py

Which additional issues this PR fixes

  1. MOD-7944
  2. Fixes Documents having uppercase fields aren't returned when using lower case letters for TEXT field search using FT.search in Redisearch #2015
  3. Fixes [BUG] FT.SEARCH word prefix is not working with cyrillic texts - [MOD-6407] #4886

Main objects this PR modified

  1. ...

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

@nafraf nafraf changed the title WIP: Support multi-byte char terms MOD-7944: Support multi-byte char terms Dec 22, 2024
@nafraf nafraf marked this pull request as ready for review December 22, 2024 17:41
@codecov
Copy link

codecov bot commented Dec 22, 2024

Codecov Report

Attention: Patch coverage is 97.18310% with 4 lines in your changes missing coverage. Please review.

Project coverage is 87.85%. Comparing base (9cb9ae8) to head (ed7af50).
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
src/query_parser/v2/parser.c 33.33% 2 Missing ⚠️
src/query_param.c 50.00% 1 Missing ⚠️
src/util/strconv.h 96.96% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5391      +/-   ##
==========================================
- Coverage   87.85%   87.85%   -0.01%     
==========================================
  Files         196      196              
  Lines       35337    35409      +72     
==========================================
+ Hits        31047    31108      +61     
- Misses       4290     4301      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nafraf nafraf requested review from DvirDukhan and raz-mon December 29, 2024 09:58
Copy link
Collaborator

@raz-mon raz-mon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice 💪
See some comments and suggestions

Copy link
Collaborator

@kei-nan kei-nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, mostly on the use of setlocale

src/tokenize.c Outdated
return dst;
}

static char *DefaultNormalize(char *s, char *dst, size_t *len) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference between this and rm_strdupcase?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some differences:

DefaultNormalize():

  • ignores control characters
  • does not allocate memory for the destination

rm_strdupcase():

  • does not handle control characters, because the lexer filter them.
  • allocate memory for the result
127.0.0.1:6379> FT.CREATE idx schema t text
OK
# This result in a term "hiworld"
127.0.0.1:6379> hset doc1  t "hi\nworld" 
(integer) 1
# You need to remove the control char to find it
127.0.0.1:6379> FT.SEARCH idx "hiworld"
1) (integer) 1
2) "doc1"
3) 1) "t"
   2) "hi\nworld"
# if you try to search with a control character, 
# the string will be split in two terms
127.0.0.1:6379> FT.SEARCH idx "hi\nworld"
1) (integer) 0
127.0.0.1:6379> FT.explaincli idx "hi\nworld"
 1) INTERSECT {
 2)   UNION {
 3)     hi
 4)     +hi(expanded)
 5)   }
 6)   UNION {
 7)     world
 8)     +world(expanded)
 9)   }
10) }

@nafraf nafraf added this pull request to the merge queue Feb 12, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 12, 2025
@nafraf nafraf added this pull request to the merge queue Feb 12, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 12, 2025
@nafraf nafraf added this pull request to the merge queue Feb 12, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 12, 2025
@nafraf nafraf added this pull request to the merge queue Feb 12, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 12, 2025
@nafraf nafraf added this pull request to the merge queue Feb 12, 2025
Merged via the queue into master with commit 43701d8 Feb 12, 2025
18 of 44 checks passed
@nafraf nafraf deleted the nafraf_multibyte-char-terms branch February 12, 2025 17:43
@redisearch-backport-pull-request
Copy link
Contributor

Backport failed for 2.8, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 2.8
git worktree add -d .worktree/backport-5391-to-2.8 origin/2.8
cd .worktree/backport-5391-to-2.8
git switch --create backport-5391-to-2.8
git cherry-pick -x 43701d8e8a46a2ecc03055d2328f2ef81677f064

@redisearch-backport-pull-request
Copy link
Contributor

Backport failed for 2.6, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 2.6
git worktree add -d .worktree/backport-5391-to-2.6 origin/2.6
cd .worktree/backport-5391-to-2.6
git switch --create backport-5391-to-2.6
git cherry-pick -x 43701d8e8a46a2ecc03055d2328f2ef81677f064

@redisearch-backport-pull-request
Copy link
Contributor

Backport failed for 2.10, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 2.10
git worktree add -d .worktree/backport-5391-to-2.10 origin/2.10
cd .worktree/backport-5391-to-2.10
git switch --create backport-5391-to-2.10
git cherry-pick -x 43701d8e8a46a2ecc03055d2328f2ef81677f064

redisearch-backport-pull-request bot pushed a commit that referenced this pull request Feb 12, 2025
* Normalize multi-byte char terms

* Fix rm_strdupcase_utf8

* Remove debug log

* Add Russian alphabet and diacritic tests

* Don't use DefaultNormalize_utf8() for chinese

* print available locales

* Check locale before using utf8 normalization

* Fix DefaultNormalize()

* Skip multibyte tests if 'en_US.UTF-8' locale is not available

* Test multibyte stopwords

* Revert changes in install_script.sh

* run sanitizer using ubuntu:latest

* Test_cn: fix language_field

* Remove unused strtolower function from misc.c and misc.h

* Fixes from code review

* Enhance multibyte character tests

* Validate queries using multi-byte stopwords

* Add MULTIBYTE_CHARS config param and install locales

* Use setlocale instead of querylocale for better compatibility

* revert changes in event-pull_request.yml and install French locale for debian/ubuntu

* Convert to lowercase using nunicode library

* nunicode_tolower() returns zero terminated string

* Refactor nunicode_tolower() to improve readability

* Refactor nunicode_tolower() to use destination buffer and improve memory management

* Remove null termination in nunicode_tolower()

* Improve documentation

* Support multi-byte chars for tags

* Support multi-byte chars for synonyms

* unicode_tolower(): avoid duplicating encoded input

* Increase SSO_MAX_LENGTH

* Update test to  use FT.DEBUG DUMP_TAGIDX instead of FT.TAGVALS

* Keep previous test_cn:testSynonym and test_cn:testMixedHighlight

* Update CMake configuration for consistent multi-byte char sorting

* Test JSON index

* Fix JSON test and remove unneeded tolower()

* Fix query_EvalSingleTagNode()

* Refactor string normalization to use case folding instead of lowercase conversion

* Refactor tests to use run_command_on_all_shards for setting DEFAULT_DIALECT

* BWC: Add function to convert string to single codepoint folded runes

* Rename rm_strdupcase() to rm_normalize()

* Add stemming test

* Refactor string normalization to use lowercase transformation instead of folding.
Modify FT.SUGGET flow to be BWC

* Refactor filtering functions to use a transformation callback for rune processing. Remove dead code.

* Update testDFAFilter to use strToSingleCodepointFoldedRunes for rune processing

* Refactor string normalization to use case folding (single codepoint) instead of lowercase transformation

* Revert "Refactor string normalization to use case folding (single codepoint) instead of lowercase transformation"

This reverts commit ebcf994.

* Add case sensitivity option to tag string processing functions

* Rename tag_strtofold to tag_strtolower for clarity and update related references to reflect lowercase transformation

* Simplify length checks in tag_strtolower and rm_normalize functions

* Add documentation after review

(cherry picked from commit 43701d8)
@redisearch-backport-pull-request
Copy link
Contributor

Successfully created backport PR for 8.0:

nafraf added a commit that referenced this pull request Feb 12, 2025
github-merge-queue bot pushed a commit that referenced this pull request Feb 12, 2025
MOD-7944: Support multi-byte char terms (#5391)

* Normalize multi-byte char terms

* Fix rm_strdupcase_utf8

* Remove debug log

* Add Russian alphabet and diacritic tests

* Don't use DefaultNormalize_utf8() for chinese

* print available locales

* Check locale before using utf8 normalization

* Fix DefaultNormalize()

* Skip multibyte tests if 'en_US.UTF-8' locale is not available

* Test multibyte stopwords

* Revert changes in install_script.sh

* run sanitizer using ubuntu:latest

* Test_cn: fix language_field

* Remove unused strtolower function from misc.c and misc.h

* Fixes from code review

* Enhance multibyte character tests

* Validate queries using multi-byte stopwords

* Add MULTIBYTE_CHARS config param and install locales

* Use setlocale instead of querylocale for better compatibility

* revert changes in event-pull_request.yml and install French locale for debian/ubuntu

* Convert to lowercase using nunicode library

* nunicode_tolower() returns zero terminated string

* Refactor nunicode_tolower() to improve readability

* Refactor nunicode_tolower() to use destination buffer and improve memory management

* Remove null termination in nunicode_tolower()

* Improve documentation

* Support multi-byte chars for tags

* Support multi-byte chars for synonyms

* unicode_tolower(): avoid duplicating encoded input

* Increase SSO_MAX_LENGTH

* Update test to  use FT.DEBUG DUMP_TAGIDX instead of FT.TAGVALS

* Keep previous test_cn:testSynonym and test_cn:testMixedHighlight

* Update CMake configuration for consistent multi-byte char sorting

* Test JSON index

* Fix JSON test and remove unneeded tolower()

* Fix query_EvalSingleTagNode()

* Refactor string normalization to use case folding instead of lowercase conversion

* Refactor tests to use run_command_on_all_shards for setting DEFAULT_DIALECT

* BWC: Add function to convert string to single codepoint folded runes

* Rename rm_strdupcase() to rm_normalize()

* Add stemming test

* Refactor string normalization to use lowercase transformation instead of folding.
Modify FT.SUGGET flow to be BWC

* Refactor filtering functions to use a transformation callback for rune processing. Remove dead code.

* Update testDFAFilter to use strToSingleCodepointFoldedRunes for rune processing

* Refactor string normalization to use case folding (single codepoint) instead of lowercase transformation

* Revert "Refactor string normalization to use case folding (single codepoint) instead of lowercase transformation"

This reverts commit ebcf994.

* Add case sensitivity option to tag string processing functions

* Rename tag_strtofold to tag_strtolower for clarity and update related references to reflect lowercase transformation

* Simplify length checks in tag_strtolower and rm_normalize functions

* Add documentation after review

(cherry picked from commit 43701d8)

Co-authored-by: nafraf <[email protected]>
nafraf added a commit that referenced this pull request Feb 12, 2025
nafraf added a commit that referenced this pull request Feb 13, 2025
github-merge-queue bot pushed a commit that referenced this pull request Feb 17, 2025
MOD-7944: Support multi-byte char terms (#5391)

(cherry picked from commit 43701d8)
github-merge-queue bot pushed a commit that referenced this pull request Feb 18, 2025
* MOD-7944: Support multi-byte char terms (#5391)

(cherry picked from commit 43701d8)

* Replace ftDebugCmdName() by debug_cmd()
github-merge-queue bot pushed a commit that referenced this pull request Feb 18, 2025
* MOD-7944: Support multi-byte char terms (#5391)

(cherry picked from commit 43701d8)

* Fix Query_EvalTagPrefixNode to use QN_PREFIX
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

5 participants