MOD-9835: Fix unicode_tolower() by dor-forer · Pull Request #6211 · RediSearch/RediSearch

dor-forer · 2025-05-26T10:20:09Z

Describe the changes in the pull request

A clear and concise description of what the PR is solving, including:

Current:
In unicode_tolower(), for characters where the lowercase version occupies more bytes than its uppercase version, the for loop which traverse the utf-8 string is exceeding the limits of the allocated memory when the length of the lowercase version of the term is longer than its original form.
In tag_strtolower() and rm_normalize(), the length passed to unicode_tolower() is the length of the original term, but it should be the length of the term after unescaping it.
Change:

In unicode_tolower():
- Replace the for loop for a while loop which check that the current utf-8 symbols is inside the valid limits.
- Break the loop if a NULL character is found
In tag_strtolower() and rm_normalize():
- Pass to unicode_tolower() the updated length after the term is unescaped.

Outcome:

Which additional issues this PR fixes

MOD-9835

Main objects this PR modified

...

Mark if applicable

This PR introduces API changes
This PR introduces serialization changes

src/util/strconv.h

Co-authored-by: meiravgri <[email protected]>

…or and buffer allocation

Copilot

Pull Request Overview

This PR fixes the issue where unicode_tolower() would write beyond the allocated memory when the lowercase version of a multibyte term is longer than its original form. It also adjusts tag_strtolower() and rm_normalize() to pass the updated term length (after unescaping) to unicode_tolower().

Replaced the for loop in unicode_tolower() with a while loop to ensure boundary checks.
Updated tag_strtolower() and rm_normalize() to use the correct length after unescaping.
Added new tests in tests/pytests/test_multibyte_char_terms.py to validate the multibyte lowercase conversion behavior.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
tests/pytests/test_multibyte_char_terms.py	Added tests to validate proper handling of multibyte lowercase conversion.
src/util/strconv.h	Modified unicode_tolower() to use a while loop and updated rm_normalize().
src/query.c	Fixed tag_strtolower() to use the current length after unescaping.

src/util/strconv.h

Copilot

Pull Request Overview

This PR ensures unicode_tolower() does not overrun buffers when lowercase expansions grow and correctly passes the post-unescape string length to it in both tag_strtolower and rm_normalize.

Switches the UTF-8 iteration in unicode_tolower from a simple for to a boundary-checked while and handles early NULL codepoints
Updates tag_strtolower and rm_normalize to use the actual length of the unescaped string when calling unicode_tolower
Adds end-to-end pytest coverage for multibyte terms whose lowercase form is larger than the original

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
tests/pytests/test_multibyte_char_terms.py	New tests for long multibyte TAG and TEXT terms
src/util/strconv.h	Revamped `unicode_tolower`, assertion, malloc logic, updated normalize
src/query.c	Pass updated `*len` to `unicode_tolower` in `tag_strtolower`

src/util/strconv.h

tests/pytests/test_multibyte_char_terms.py

Co-authored-by: Copilot <[email protected]>

meiravgri

great!! some small comemnts and don't listen to copilot

src/query.c

src/util/strconv.h

tests/pytests/test_multibyte_char_terms.py

…mes for clarity on UTF-8 behavior

meiravgri · 2025-05-29T15:35:18Z

src/query.c

-    size_t newLen = unicode_tolower(origStr, *len);
+    size_t newLen = unicode_tolower(origStr, length);
    if (newLen) {
      origStr[newLen] = '\0';


Just a small thought — I think this part might be a bit sensitive down the line when reallocation is added. If the buffer is resized to exactly newLen, then writing origStr[newLen] = '\0'; would write one byte past the end. Would be easy to miss, and could cause a subtle overflow 😅.

Maybe worth adding a test to catch this kind of edge case early? The Turkish 'İ' (U+0130) is a good one — it’s 2 bytes uppercase and becomes 3 bytes lowercase. 😊

OK. You are right. I added more tests with lengths around the limit which will require memory reallocation.
Please check if it is enough.

meiravgri · 2025-05-30T04:39:20Z

tests/pytests/test_multibyte_char_terms.py

+                'FT.SEARCH', 'idx', f'@t:{{"{t_lower}"}}', 'NOCONTENT', 'DIALECT', dialect)
+            env.assertEqual(res, [1, '{doc}:2'])
+
+        # Test the edge cases where the length is around the limit to require


By “reallocation” I meant cases where the lowercase output is longer than the input — a single-character string like İ would cover that.
Is there a reason to use the more complex loop if a simpler test would make the case clearer? (Especially since the previous test already covers sequences that exceed the stack allocation size.)

* Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h Co-authored-by: meiravgri <[email protected]> * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo Co-authored-by: Copilot <[email protected]> * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]> (cherry picked from commit eb11bfd)

redisearch-backport-pull-request · 2025-06-05T13:36:49Z

Successfully created backport PR for 2.8:

[2.8] MOD-9835: Fix unicode_tolower() #6263

* Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h Co-authored-by: meiravgri <[email protected]> * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo Co-authored-by: Copilot <[email protected]> * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]> (cherry picked from commit eb11bfd)

redisearch-backport-pull-request · 2025-06-05T13:36:53Z

Successfully created backport PR for 2.6:

[2.6] MOD-9835: Fix unicode_tolower() #6264

* Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h Co-authored-by: meiravgri <[email protected]> * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo Co-authored-by: Copilot <[email protected]> * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]> (cherry picked from commit eb11bfd)

redisearch-backport-pull-request · 2025-06-05T13:36:56Z

Successfully created backport PR for 2.10:

[2.10] MOD-9835: Fix unicode_tolower() #6265

* Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h Co-authored-by: meiravgri <[email protected]> * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo Co-authored-by: Copilot <[email protected]> * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]> (cherry picked from commit eb11bfd)

redisearch-backport-pull-request · 2025-06-05T13:37:00Z

Successfully created backport PR for 8.0:

[8.0] MOD-9835: Fix unicode_tolower() #6266

* Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h Co-authored-by: meiravgri <[email protected]> * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo Co-authored-by: Copilot <[email protected]> * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]> (cherry picked from commit eb11bfd)

redisearch-backport-pull-request · 2025-06-05T13:37:03Z

Successfully created backport PR for 8.2:

[8.2] MOD-9835: Fix unicode_tolower() #6267

MOD-9835: Fix unicode_tolower() (#6211) * Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- (cherry picked from commit eb11bfd) Co-authored-by: dor-forer <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]>

* MOD-9835: Fix unicode_tolower() (#6211) * Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h Co-authored-by: meiravgri <[email protected]> * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo Co-authored-by: Copilot <[email protected]> * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]> (cherry picked from commit eb11bfd) * Add missing include for rmutil/alloc.h in unicode_tolower tests --------- Co-authored-by: dor-forer <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]>

* MOD-9835: Fix unicode_tolower() (#6211) * Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h Co-authored-by: meiravgri <[email protected]> * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo Co-authored-by: Copilot <[email protected]> * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]> (cherry picked from commit eb11bfd) * Add missing include for rmutil/alloc.h in unicode_tolower tests * Update test_utf8_lowercase_longer_than_uppercase_tags: remote tag autoescape test * Move include rmutil/alloc.c to strconv.h --------- Co-authored-by: dor-forer <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]>

* MOD-9835: Fix unicode_tolower() (#6211) * Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h Co-authored-by: meiravgri <[email protected]> * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo Co-authored-by: Copilot <[email protected]> * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]> (cherry picked from commit eb11bfd) * Add missing include for rmutil/alloc.h in strconv.h * Fix test, remove dialect 4 * Update test_utf8_lowercase_longer_than_uppercase_tags: remote tag autoescape test (cherry picked from commit 66e4942) --------- Co-authored-by: dor-forer <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]>

Bigger buffer and tests

1bd11ad

github-actions bot added the size:S label May 26, 2025

dor-forer and others added 2 commits May 27, 2025 08:44

Fixes

e6123af

Fix tag_strtolower to fix test with escaped TAG

4f64ea6

github-actions bot added the size:M label May 27, 2025

Fix typo

cf04236

nafraf changed the title ~~Bigger buffer and tests~~ MOD-9835: Fix unicode_tolower() May 27, 2025

meiravgri reviewed May 28, 2025

View reviewed changes

src/util/strconv.h Outdated Show resolved Hide resolved

meiravgri reviewed May 28, 2025

View reviewed changes

src/util/strconv.h Show resolved Hide resolved

meiravgri reviewed May 28, 2025

View reviewed changes

src/util/strconv.h Outdated Show resolved Hide resolved

nafraf and others added 3 commits May 28, 2025 14:48

Fix rm_normalize() and test dialects

6cace4d

Update src/util/strconv.h

e431b3c

Co-authored-by: meiravgri <[email protected]>

Enhance unicode_tolower comments for clarity on transformation behavi…

99904cb

…or and buffer allocation

nafraf requested review from Copilot and meiravgri May 28, 2025 20:08

Copilot AI reviewed May 28, 2025

View reviewed changes

src/util/strconv.h Outdated Show resolved Hide resolved

Fix assertion

e8054b1

nafraf requested a review from Copilot May 28, 2025 20:45

Copilot AI reviewed May 28, 2025

View reviewed changes

src/util/strconv.h Outdated Show resolved Hide resolved

src/util/strconv.h Show resolved Hide resolved

src/util/strconv.h Show resolved Hide resolved

tests/pytests/test_multibyte_char_terms.py Outdated Show resolved Hide resolved

Fix typo

5c15b0b

Co-authored-by: Copilot <[email protected]>

meiravgri reviewed May 29, 2025

View reviewed changes

Refactor tag_strtolower to improve length handling and update test na…

3b7ff87

…mes for clarity on UTF-8 behavior

meiravgri reviewed May 29, 2025

View reviewed changes

Add more tests

0889568

nafraf requested a review from meiravgri May 29, 2025 20:03

Add test comments

c7200df

meiravgri reviewed May 30, 2025

View reviewed changes

nafraf added 3 commits June 2, 2025 09:43

Simplify tests

0758420

Fix typo

0f93b08

Test with a single multi-byte char

3b5ee4b

oshadmi added backport 2.8 backport 2.6 backport 2.10 backport 8.0 backport 8.2 labels Jun 5, 2025

nafraf added this pull request to the merge queue Jun 5, 2025

Merged via the queue into master with commit eb11bfd Jun 5, 2025
24 checks passed

nafraf deleted the dorav-fix-buffer-size-tolower branch June 5, 2025 13:36

redisearch-backport-pull-request bot mentioned this pull request Jun 5, 2025

[2.8] MOD-9835: Fix unicode_tolower() #6263

Merged

redisearch-backport-pull-request bot mentioned this pull request Jun 5, 2025

[2.6] MOD-9835: Fix unicode_tolower() #6264

Merged

redisearch-backport-pull-request bot mentioned this pull request Jun 5, 2025

[2.10] MOD-9835: Fix unicode_tolower() #6265

Merged

redisearch-backport-pull-request bot mentioned this pull request Jun 5, 2025

[8.0] MOD-9835: Fix unicode_tolower() #6266

Merged

redisearch-backport-pull-request bot mentioned this pull request Jun 5, 2025

[8.2] MOD-9835: Fix unicode_tolower() #6267

Merged

oshadmi mentioned this pull request Jun 13, 2025

MOD-8799: Support special utf8 #5637

Merged

2 tasks

Conversation

dor-forer commented May 26, 2025 • edited by nafraf Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the changes in the pull request

Which additional issues this PR fixes

Main objects this PR modified

Mark if applicable

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

meiravgri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

meiravgri May 29, 2025

Choose a reason for hiding this comment

Uh oh!

nafraf May 29, 2025

Choose a reason for hiding this comment

Uh oh!

meiravgri May 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

redisearch-backport-pull-request bot commented Jun 5, 2025

Uh oh!

redisearch-backport-pull-request bot commented Jun 5, 2025

Uh oh!

redisearch-backport-pull-request bot commented Jun 5, 2025

Uh oh!

redisearch-backport-pull-request bot commented Jun 5, 2025

Uh oh!

redisearch-backport-pull-request bot commented Jun 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dor-forer commented May 26, 2025 •

edited by nafraf

Loading