Skip to content

MOD-9835: Fix unicode_tolower()#6211

Merged
nafraf merged 21 commits intomasterfrom
dorav-fix-buffer-size-tolower
Jun 5, 2025
Merged

MOD-9835: Fix unicode_tolower()#6211
nafraf merged 21 commits intomasterfrom
dorav-fix-buffer-size-tolower

Conversation

@dor-forer
Copy link
Collaborator

@dor-forer dor-forer commented May 26, 2025

Describe the changes in the pull request

A clear and concise description of what the PR is solving, including:

  1. Current:
    In unicode_tolower(), for characters where the lowercase version occupies more bytes than its uppercase version, the for loop which traverse the utf-8 string is exceeding the limits of the allocated memory when the length of the lowercase version of the term is longer than its original form.
    In tag_strtolower() and rm_normalize(), the length passed to unicode_tolower() is the length of the original term, but it should be the length of the term after unescaping it.

  2. Change:

  • In unicode_tolower():
    • Replace the for loop for a while loop which check that the current utf-8 symbols is inside the valid limits.
    • Break the loop if a NULL character is found
  • In tag_strtolower() and rm_normalize():
    • Pass to unicode_tolower() the updated length after the term is unescaped.
  1. Outcome:

Which additional issues this PR fixes

  1. MOD-9835

Main objects this PR modified

  1. ...

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

@nafraf nafraf changed the title Bigger buffer and tests MOD-9835: Fix unicode_tolower() May 27, 2025
@nafraf nafraf requested review from Copilot and meiravgri May 28, 2025 20:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the issue where unicode_tolower() would write beyond the allocated memory when the lowercase version of a multibyte term is longer than its original form. It also adjusts tag_strtolower() and rm_normalize() to pass the updated term length (after unescaping) to unicode_tolower().

  • Replaced the for loop in unicode_tolower() with a while loop to ensure boundary checks.
  • Updated tag_strtolower() and rm_normalize() to use the correct length after unescaping.
  • Added new tests in tests/pytests/test_multibyte_char_terms.py to validate the multibyte lowercase conversion behavior.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
tests/pytests/test_multibyte_char_terms.py Added tests to validate proper handling of multibyte lowercase conversion.
src/util/strconv.h Modified unicode_tolower() to use a while loop and updated rm_normalize().
src/query.c Fixed tag_strtolower() to use the current length after unescaping.

@nafraf nafraf requested a review from Copilot May 28, 2025 20:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR ensures unicode_tolower() does not overrun buffers when lowercase expansions grow and correctly passes the post-unescape string length to it in both tag_strtolower and rm_normalize.

  • Switches the UTF-8 iteration in unicode_tolower from a simple for to a boundary-checked while and handles early NULL codepoints
  • Updates tag_strtolower and rm_normalize to use the actual length of the unescaped string when calling unicode_tolower
  • Adds end-to-end pytest coverage for multibyte terms whose lowercase form is larger than the original

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
tests/pytests/test_multibyte_char_terms.py New tests for long multibyte TAG and TEXT terms
src/util/strconv.h Revamped unicode_tolower, assertion, malloc logic, updated normalize
src/query.c Pass updated *len to unicode_tolower in tag_strtolower

Co-authored-by: Copilot <[email protected]>
Copy link
Collaborator

@meiravgri meiravgri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!! some small comemnts and don't listen to copilot

size_t newLen = unicode_tolower(origStr, *len);
size_t newLen = unicode_tolower(origStr, length);
if (newLen) {
origStr[newLen] = '\0';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small thought — I think this part might be a bit sensitive down the line when reallocation is added. If the buffer is resized to exactly newLen, then writing origStr[newLen] = '\0'; would write one byte past the end. Would be easy to miss, and could cause a subtle overflow 😅.

Maybe worth adding a test to catch this kind of edge case early? The Turkish 'İ' (U+0130) is a good one — it’s 2 bytes uppercase and becomes 3 bytes lowercase. 😊

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. You are right. I added more tests with lengths around the limit which will require memory reallocation.
Please check if it is enough.

@nafraf nafraf requested a review from meiravgri May 29, 2025 20:03
'FT.SEARCH', 'idx', f'@t:{{"{t_lower}"}}', 'NOCONTENT', 'DIALECT', dialect)
env.assertEqual(res, [1, '{doc}:2'])

# Test the edge cases where the length is around the limit to require
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By “reallocation” I meant cases where the lowercase output is longer than the input — a single-character string like İ would cover that.
Is there a reason to use the more complex loop if a simpler test would make the case clearer? (Especially since the previous test already covers sequences that exceed the stack allocation size.)

@nafraf nafraf added this pull request to the merge queue Jun 5, 2025
Merged via the queue into master with commit eb11bfd Jun 5, 2025
24 checks passed
@nafraf nafraf deleted the dorav-fix-buffer-size-tolower branch June 5, 2025 13:36
redisearch-backport-pull-request bot pushed a commit that referenced this pull request Jun 5, 2025
* Bigger buffer and tests

* Fixes

* Fix tag_strtolower to fix test with escaped TAG

* Fix typo

* Fix rm_normalize() and test dialects

* Update src/util/strconv.h

Co-authored-by: meiravgri <[email protected]>

* Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation

* Fix assertion

* Fix typo

Co-authored-by: Copilot <[email protected]>

* Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior

* Add more tests

* Add test comments

* Simplify tests

* Fix typo

* Test with a single multi-byte char

* Add cpp test for unicode_tolower()

* Fix testSpecialUnicodeCase

* update u_buffer allocation for better maintainability

* Add tests lower UTF8 byes exceeding max size

* Fix typo

---------

Co-authored-by: nafraf <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: meiravgri <[email protected]>
Co-authored-by: Copilot <[email protected]>
(cherry picked from commit eb11bfd)
@redisearch-backport-pull-request
Copy link
Contributor

Successfully created backport PR for 2.8:

redisearch-backport-pull-request bot pushed a commit that referenced this pull request Jun 5, 2025
* Bigger buffer and tests

* Fixes

* Fix tag_strtolower to fix test with escaped TAG

* Fix typo

* Fix rm_normalize() and test dialects

* Update src/util/strconv.h

Co-authored-by: meiravgri <[email protected]>

* Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation

* Fix assertion

* Fix typo

Co-authored-by: Copilot <[email protected]>

* Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior

* Add more tests

* Add test comments

* Simplify tests

* Fix typo

* Test with a single multi-byte char

* Add cpp test for unicode_tolower()

* Fix testSpecialUnicodeCase

* update u_buffer allocation for better maintainability

* Add tests lower UTF8 byes exceeding max size

* Fix typo

---------

Co-authored-by: nafraf <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: meiravgri <[email protected]>
Co-authored-by: Copilot <[email protected]>
(cherry picked from commit eb11bfd)
@redisearch-backport-pull-request
Copy link
Contributor

Successfully created backport PR for 2.6:

redisearch-backport-pull-request bot pushed a commit that referenced this pull request Jun 5, 2025
* Bigger buffer and tests

* Fixes

* Fix tag_strtolower to fix test with escaped TAG

* Fix typo

* Fix rm_normalize() and test dialects

* Update src/util/strconv.h

Co-authored-by: meiravgri <[email protected]>

* Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation

* Fix assertion

* Fix typo

Co-authored-by: Copilot <[email protected]>

* Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior

* Add more tests

* Add test comments

* Simplify tests

* Fix typo

* Test with a single multi-byte char

* Add cpp test for unicode_tolower()

* Fix testSpecialUnicodeCase

* update u_buffer allocation for better maintainability

* Add tests lower UTF8 byes exceeding max size

* Fix typo

---------

Co-authored-by: nafraf <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: meiravgri <[email protected]>
Co-authored-by: Copilot <[email protected]>
(cherry picked from commit eb11bfd)
@redisearch-backport-pull-request
Copy link
Contributor

Successfully created backport PR for 2.10:

redisearch-backport-pull-request bot pushed a commit that referenced this pull request Jun 5, 2025
* Bigger buffer and tests

* Fixes

* Fix tag_strtolower to fix test with escaped TAG

* Fix typo

* Fix rm_normalize() and test dialects

* Update src/util/strconv.h

Co-authored-by: meiravgri <[email protected]>

* Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation

* Fix assertion

* Fix typo

Co-authored-by: Copilot <[email protected]>

* Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior

* Add more tests

* Add test comments

* Simplify tests

* Fix typo

* Test with a single multi-byte char

* Add cpp test for unicode_tolower()

* Fix testSpecialUnicodeCase

* update u_buffer allocation for better maintainability

* Add tests lower UTF8 byes exceeding max size

* Fix typo

---------

Co-authored-by: nafraf <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: meiravgri <[email protected]>
Co-authored-by: Copilot <[email protected]>
(cherry picked from commit eb11bfd)
@redisearch-backport-pull-request
Copy link
Contributor

Successfully created backport PR for 8.0:

redisearch-backport-pull-request bot pushed a commit that referenced this pull request Jun 5, 2025
* Bigger buffer and tests

* Fixes

* Fix tag_strtolower to fix test with escaped TAG

* Fix typo

* Fix rm_normalize() and test dialects

* Update src/util/strconv.h

Co-authored-by: meiravgri <[email protected]>

* Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation

* Fix assertion

* Fix typo

Co-authored-by: Copilot <[email protected]>

* Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior

* Add more tests

* Add test comments

* Simplify tests

* Fix typo

* Test with a single multi-byte char

* Add cpp test for unicode_tolower()

* Fix testSpecialUnicodeCase

* update u_buffer allocation for better maintainability

* Add tests lower UTF8 byes exceeding max size

* Fix typo

---------

Co-authored-by: nafraf <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: meiravgri <[email protected]>
Co-authored-by: Copilot <[email protected]>
(cherry picked from commit eb11bfd)
@redisearch-backport-pull-request
Copy link
Contributor

Successfully created backport PR for 8.2:

github-merge-queue bot pushed a commit that referenced this pull request Jun 5, 2025
MOD-9835: Fix unicode_tolower() (#6211)

* Bigger buffer and tests

* Fixes

* Fix tag_strtolower to fix test with escaped TAG

* Fix typo

* Fix rm_normalize() and test dialects

* Update src/util/strconv.h



* Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation

* Fix assertion

* Fix typo



* Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior

* Add more tests

* Add test comments

* Simplify tests

* Fix typo

* Test with a single multi-byte char

* Add cpp test for unicode_tolower()

* Fix testSpecialUnicodeCase

* update u_buffer allocation for better maintainability

* Add tests lower UTF8 byes exceeding max size

* Fix typo

---------





(cherry picked from commit eb11bfd)

Co-authored-by: dor-forer <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: meiravgri <[email protected]>
Co-authored-by: Copilot <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Jun 5, 2025
MOD-9835: Fix unicode_tolower() (#6211)

* Bigger buffer and tests

* Fixes

* Fix tag_strtolower to fix test with escaped TAG

* Fix typo

* Fix rm_normalize() and test dialects

* Update src/util/strconv.h



* Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation

* Fix assertion

* Fix typo



* Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior

* Add more tests

* Add test comments

* Simplify tests

* Fix typo

* Test with a single multi-byte char

* Add cpp test for unicode_tolower()

* Fix testSpecialUnicodeCase

* update u_buffer allocation for better maintainability

* Add tests lower UTF8 byes exceeding max size

* Fix typo

---------





(cherry picked from commit eb11bfd)

Co-authored-by: dor-forer <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: meiravgri <[email protected]>
Co-authored-by: Copilot <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Jun 8, 2025
* MOD-9835: Fix unicode_tolower() (#6211)

* Bigger buffer and tests

* Fixes

* Fix tag_strtolower to fix test with escaped TAG

* Fix typo

* Fix rm_normalize() and test dialects

* Update src/util/strconv.h

Co-authored-by: meiravgri <[email protected]>

* Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation

* Fix assertion

* Fix typo

Co-authored-by: Copilot <[email protected]>

* Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior

* Add more tests

* Add test comments

* Simplify tests

* Fix typo

* Test with a single multi-byte char

* Add cpp test for unicode_tolower()

* Fix testSpecialUnicodeCase

* update u_buffer allocation for better maintainability

* Add tests lower UTF8 byes exceeding max size

* Fix typo

---------

Co-authored-by: nafraf <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: meiravgri <[email protected]>
Co-authored-by: Copilot <[email protected]>
(cherry picked from commit eb11bfd)

* Add missing include for rmutil/alloc.h in unicode_tolower tests

---------

Co-authored-by: dor-forer <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: meiravgri <[email protected]>
Co-authored-by: Copilot <[email protected]>
@oshadmi oshadmi mentioned this pull request Jun 13, 2025
2 tasks
github-merge-queue bot pushed a commit that referenced this pull request Jun 19, 2025
* MOD-9835: Fix unicode_tolower() (#6211)

* Bigger buffer and tests

* Fixes

* Fix tag_strtolower to fix test with escaped TAG

* Fix typo

* Fix rm_normalize() and test dialects

* Update src/util/strconv.h

Co-authored-by: meiravgri <[email protected]>

* Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation

* Fix assertion

* Fix typo

Co-authored-by: Copilot <[email protected]>

* Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior

* Add more tests

* Add test comments

* Simplify tests

* Fix typo

* Test with a single multi-byte char

* Add cpp test for unicode_tolower()

* Fix testSpecialUnicodeCase

* update u_buffer allocation for better maintainability

* Add tests lower UTF8 byes exceeding max size

* Fix typo

---------

Co-authored-by: nafraf <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: meiravgri <[email protected]>
Co-authored-by: Copilot <[email protected]>
(cherry picked from commit eb11bfd)

* Add missing include for rmutil/alloc.h in unicode_tolower tests

* Update test_utf8_lowercase_longer_than_uppercase_tags: remote tag autoescape test

* Move include rmutil/alloc.c to strconv.h

---------

Co-authored-by: dor-forer <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: meiravgri <[email protected]>
Co-authored-by: Copilot <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Jun 19, 2025
* MOD-9835: Fix unicode_tolower() (#6211)

* Bigger buffer and tests

* Fixes

* Fix tag_strtolower to fix test with escaped TAG

* Fix typo

* Fix rm_normalize() and test dialects

* Update src/util/strconv.h

Co-authored-by: meiravgri <[email protected]>

* Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation

* Fix assertion

* Fix typo

Co-authored-by: Copilot <[email protected]>

* Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior

* Add more tests

* Add test comments

* Simplify tests

* Fix typo

* Test with a single multi-byte char

* Add cpp test for unicode_tolower()

* Fix testSpecialUnicodeCase

* update u_buffer allocation for better maintainability

* Add tests lower UTF8 byes exceeding max size

* Fix typo

---------

Co-authored-by: nafraf <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: meiravgri <[email protected]>
Co-authored-by: Copilot <[email protected]>
(cherry picked from commit eb11bfd)

* Add missing include for rmutil/alloc.h in strconv.h

* Fix test, remove dialect 4

* Update test_utf8_lowercase_longer_than_uppercase_tags: remote tag autoescape test

(cherry picked from commit 66e4942)

---------

Co-authored-by: dor-forer <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: nafraf <[email protected]>
Co-authored-by: meiravgri <[email protected]>
Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants