Skip to content

[8.0] MOD-7944: Support multi-byte char terms#5642

Merged
redisearch-backport-pull-request[bot] merged 1 commit into8.0from
backport-5391-to-8.0
Feb 12, 2025
Merged

[8.0] MOD-7944: Support multi-byte char terms#5642
redisearch-backport-pull-request[bot] merged 1 commit into8.0from
backport-5391-to-8.0

Conversation

@redisearch-backport-pull-request
Copy link
Contributor

Description

Backport of #5391 to 8.0.

* Normalize multi-byte char terms

* Fix rm_strdupcase_utf8

* Remove debug log

* Add Russian alphabet and diacritic tests

* Don't use DefaultNormalize_utf8() for chinese

* print available locales

* Check locale before using utf8 normalization

* Fix DefaultNormalize()

* Skip multibyte tests if 'en_US.UTF-8' locale is not available

* Test multibyte stopwords

* Revert changes in install_script.sh

* run sanitizer using ubuntu:latest

* Test_cn: fix language_field

* Remove unused strtolower function from misc.c and misc.h

* Fixes from code review

* Enhance multibyte character tests

* Validate queries using multi-byte stopwords

* Add MULTIBYTE_CHARS config param and install locales

* Use setlocale instead of querylocale for better compatibility

* revert changes in event-pull_request.yml and install French locale for debian/ubuntu

* Convert to lowercase using nunicode library

* nunicode_tolower() returns zero terminated string

* Refactor nunicode_tolower() to improve readability

* Refactor nunicode_tolower() to use destination buffer and improve memory management

* Remove null termination in nunicode_tolower()

* Improve documentation

* Support multi-byte chars for tags

* Support multi-byte chars for synonyms

* unicode_tolower(): avoid duplicating encoded input

* Increase SSO_MAX_LENGTH

* Update test to  use FT.DEBUG DUMP_TAGIDX instead of FT.TAGVALS

* Keep previous test_cn:testSynonym and test_cn:testMixedHighlight

* Update CMake configuration for consistent multi-byte char sorting

* Test JSON index

* Fix JSON test and remove unneeded tolower()

* Fix query_EvalSingleTagNode()

* Refactor string normalization to use case folding instead of lowercase conversion

* Refactor tests to use run_command_on_all_shards for setting DEFAULT_DIALECT

* BWC: Add function to convert string to single codepoint folded runes

* Rename rm_strdupcase() to rm_normalize()

* Add stemming test

* Refactor string normalization to use lowercase transformation instead of folding.
Modify FT.SUGGET flow to be BWC

* Refactor filtering functions to use a transformation callback for rune processing. Remove dead code.

* Update testDFAFilter to use strToSingleCodepointFoldedRunes for rune processing

* Refactor string normalization to use case folding (single codepoint) instead of lowercase transformation

* Revert "Refactor string normalization to use case folding (single codepoint) instead of lowercase transformation"

This reverts commit ebcf994.

* Add case sensitivity option to tag string processing functions

* Rename tag_strtofold to tag_strtolower for clarity and update related references to reflect lowercase transformation

* Simplify length checks in tag_strtolower and rm_normalize functions

* Add documentation after review

(cherry picked from commit 43701d8)
@codecov
Copy link

codecov bot commented Feb 12, 2025

Codecov Report

Attention: Patch coverage is 97.18310% with 4 lines in your changes missing coverage. Please review.

Project coverage is 87.85%. Comparing base (3d4b51e) to head (8510c50).
Report is 2 commits behind head on 8.0.

Files with missing lines Patch % Lines
src/query_parser/v2/parser.c 33.33% 2 Missing ⚠️
src/query_param.c 50.00% 1 Missing ⚠️
src/util/strconv.h 96.96% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              8.0    #5642      +/-   ##
==========================================
+ Coverage   87.83%   87.85%   +0.02%     
==========================================
  Files         196      196              
  Lines       35306    35378      +72     
==========================================
+ Hits        31011    31083      +72     
  Misses       4295     4295              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@redisearch-backport-pull-request redisearch-backport-pull-request bot added this pull request to the merge queue Feb 12, 2025
Merged via the queue into 8.0 with commit d990dbe Feb 12, 2025
9 checks passed
@redisearch-backport-pull-request redisearch-backport-pull-request bot deleted the backport-5391-to-8.0 branch February 12, 2025 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant