Conversation
Co-authored-by: meiravgri <[email protected]>
…or and buffer allocation
There was a problem hiding this comment.
Pull Request Overview
This PR fixes the issue where unicode_tolower() would write beyond the allocated memory when the lowercase version of a multibyte term is longer than its original form. It also adjusts tag_strtolower() and rm_normalize() to pass the updated term length (after unescaping) to unicode_tolower().
- Replaced the for loop in unicode_tolower() with a while loop to ensure boundary checks.
- Updated tag_strtolower() and rm_normalize() to use the correct length after unescaping.
- Added new tests in tests/pytests/test_multibyte_char_terms.py to validate the multibyte lowercase conversion behavior.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/pytests/test_multibyte_char_terms.py | Added tests to validate proper handling of multibyte lowercase conversion. |
| src/util/strconv.h | Modified unicode_tolower() to use a while loop and updated rm_normalize(). |
| src/query.c | Fixed tag_strtolower() to use the current length after unescaping. |
There was a problem hiding this comment.
Pull Request Overview
This PR ensures unicode_tolower() does not overrun buffers when lowercase expansions grow and correctly passes the post-unescape string length to it in both tag_strtolower and rm_normalize.
- Switches the UTF-8 iteration in
unicode_tolowerfrom a simpleforto a boundary-checkedwhileand handles early NULL codepoints - Updates
tag_strtolowerandrm_normalizeto use the actual length of the unescaped string when callingunicode_tolower - Adds end-to-end pytest coverage for multibyte terms whose lowercase form is larger than the original
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tests/pytests/test_multibyte_char_terms.py | New tests for long multibyte TAG and TEXT terms |
| src/util/strconv.h | Revamped unicode_tolower, assertion, malloc logic, updated normalize |
| src/query.c | Pass updated *len to unicode_tolower in tag_strtolower |
Co-authored-by: Copilot <[email protected]>
meiravgri
left a comment
There was a problem hiding this comment.
great!! some small comemnts and don't listen to copilot
…mes for clarity on UTF-8 behavior
| size_t newLen = unicode_tolower(origStr, *len); | ||
| size_t newLen = unicode_tolower(origStr, length); | ||
| if (newLen) { | ||
| origStr[newLen] = '\0'; |
There was a problem hiding this comment.
Just a small thought — I think this part might be a bit sensitive down the line when reallocation is added. If the buffer is resized to exactly newLen, then writing origStr[newLen] = '\0'; would write one byte past the end. Would be easy to miss, and could cause a subtle overflow 😅.
Maybe worth adding a test to catch this kind of edge case early? The Turkish 'İ' (U+0130) is a good one — it’s 2 bytes uppercase and becomes 3 bytes lowercase. 😊
There was a problem hiding this comment.
OK. You are right. I added more tests with lengths around the limit which will require memory reallocation.
Please check if it is enough.
| 'FT.SEARCH', 'idx', f'@t:{{"{t_lower}"}}', 'NOCONTENT', 'DIALECT', dialect) | ||
| env.assertEqual(res, [1, '{doc}:2']) | ||
|
|
||
| # Test the edge cases where the length is around the limit to require |
There was a problem hiding this comment.
By “reallocation” I meant cases where the lowercase output is longer than the input — a single-character string like İ would cover that.
Is there a reason to use the more complex loop if a simpler test would make the case clearer? (Especially since the previous test already covers sequences that exceed the stack allocation size.)
* Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h Co-authored-by: meiravgri <[email protected]> * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo Co-authored-by: Copilot <[email protected]> * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]> (cherry picked from commit eb11bfd)
|
Successfully created backport PR for |
* Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h Co-authored-by: meiravgri <[email protected]> * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo Co-authored-by: Copilot <[email protected]> * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]> (cherry picked from commit eb11bfd)
|
Successfully created backport PR for |
* Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h Co-authored-by: meiravgri <[email protected]> * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo Co-authored-by: Copilot <[email protected]> * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]> (cherry picked from commit eb11bfd)
|
Successfully created backport PR for |
* Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h Co-authored-by: meiravgri <[email protected]> * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo Co-authored-by: Copilot <[email protected]> * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]> (cherry picked from commit eb11bfd)
|
Successfully created backport PR for |
* Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h Co-authored-by: meiravgri <[email protected]> * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo Co-authored-by: Copilot <[email protected]> * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]> (cherry picked from commit eb11bfd)
|
Successfully created backport PR for |
MOD-9835: Fix unicode_tolower() (#6211) * Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- (cherry picked from commit eb11bfd) Co-authored-by: dor-forer <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]>
MOD-9835: Fix unicode_tolower() (#6211) * Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- (cherry picked from commit eb11bfd) Co-authored-by: dor-forer <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]>
* MOD-9835: Fix unicode_tolower() (#6211) * Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h Co-authored-by: meiravgri <[email protected]> * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo Co-authored-by: Copilot <[email protected]> * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]> (cherry picked from commit eb11bfd) * Add missing include for rmutil/alloc.h in unicode_tolower tests --------- Co-authored-by: dor-forer <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]>
* MOD-9835: Fix unicode_tolower() (#6211) * Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h Co-authored-by: meiravgri <[email protected]> * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo Co-authored-by: Copilot <[email protected]> * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]> (cherry picked from commit eb11bfd) * Add missing include for rmutil/alloc.h in unicode_tolower tests * Update test_utf8_lowercase_longer_than_uppercase_tags: remote tag autoescape test * Move include rmutil/alloc.c to strconv.h --------- Co-authored-by: dor-forer <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]>
* MOD-9835: Fix unicode_tolower() (#6211) * Bigger buffer and tests * Fixes * Fix tag_strtolower to fix test with escaped TAG * Fix typo * Fix rm_normalize() and test dialects * Update src/util/strconv.h Co-authored-by: meiravgri <[email protected]> * Enhance unicode_tolower comments for clarity on transformation behavior and buffer allocation * Fix assertion * Fix typo Co-authored-by: Copilot <[email protected]> * Refactor tag_strtolower to improve length handling and update test names for clarity on UTF-8 behavior * Add more tests * Add test comments * Simplify tests * Fix typo * Test with a single multi-byte char * Add cpp test for unicode_tolower() * Fix testSpecialUnicodeCase * update u_buffer allocation for better maintainability * Add tests lower UTF8 byes exceeding max size * Fix typo --------- Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]> (cherry picked from commit eb11bfd) * Add missing include for rmutil/alloc.h in strconv.h * Fix test, remove dialect 4 * Update test_utf8_lowercase_longer_than_uppercase_tags: remote tag autoescape test (cherry picked from commit 66e4942) --------- Co-authored-by: dor-forer <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: nafraf <[email protected]> Co-authored-by: meiravgri <[email protected]> Co-authored-by: Copilot <[email protected]>
Describe the changes in the pull request
A clear and concise description of what the PR is solving, including:
Current:
In
unicode_tolower(), for characters where the lowercase version occupies more bytes than its uppercase version, theforloop which traverse the utf-8 string is exceeding the limits of the allocated memory when the length of the lowercase version of the term is longer than its original form.In
tag_strtolower()andrm_normalize(), the length passed tounicode_tolower()is the length of the original term, but it should be the length of the term after unescaping it.Change:
unicode_tolower():forloop for awhileloop which check that the current utf-8 symbols is inside the valid limits.tag_strtolower()andrm_normalize():unicode_tolower()the updated length after the term is unescaped.Which additional issues this PR fixes
Main objects this PR modified
Mark if applicable