Skip to content

Fix heap corruption for strings larger than 2114 bytes#211

Merged
yanyiwu merged 2 commits intomasterfrom
copilot/fix-corrupted-size-error
Mar 7, 2026
Merged

Fix heap corruption for strings larger than 2114 bytes#211
yanyiwu merged 2 commits intomasterfrom
copilot/fix-corrupted-size-error

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 5, 2026

Processing strings longer than ~2114 bytes causes a corrupted size vs. prev_size glibc malloc abort. The culprit is RuneStrArray being backed by LocalVector<struct RuneStr> — a container that explicitly documents it is only safe for primitive types, not structs.

Changes

  • include/cppjieba/Unicode.hpp
    • Change RuneStrArray from limonp::LocalVector<struct RuneStr> to std::vector<RuneStr>
    • Change DecodeUTF8RunesInString loop counter from uint32_t i to size_t i to eliminate infinite-loop risk on strings >4 GB; add explicit overflow guard before narrowing casts to uint32_t fields in RuneStr
// Before
typedef limonp::LocalVector<struct RuneStr> RuneStrArray;
for (uint32_t i = 0, j = 0; i < len;) { ... }

// After
typedef std::vector<RuneStr> RuneStrArray;
for (size_t i = 0, j = 0; i < len;) {
    if (i > UINT32_MAX || j > UINT32_MAX) { runes.clear(); return false; }
    ...
}
  • test/unittest/unicode_test.cpp — Update UnicodeTest.Test1 expected string to match std::vector's operator<< format (no extra element-wrapping quotes).
  • test/unittest/segments_test.cpp — Add MixSegmentTest.LongInput regression test covering Chinese and ASCII inputs exceeding 2114 bytes.
Original prompt

This section details on the original issue you should resolve

<issue_title>corrupted size vs. prev_size</issue_title>
<issue_description>For jieba.cut(line), When the line's size > 2114, I will meet:

corrupted size vs. prev_size

then exit as 134.

Does it mean the line is too long to cut word?</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

- Replace RuneStrArray (LocalVector<RuneStr>) with std::vector<RuneStr>
  since LocalVector explicitly warns it's only safe for primitive types,
  not structs like RuneStr
- Change loop counter in DecodeUTF8RunesInString from uint32_t to
  size_t to prevent integer overflow for very long strings, with an
  explicit overflow check before truncating to uint32_t for RuneStr fields
- Update UnicodeTest.Test1 expected string to match std::vector
  operator<< format
- Add MixSegmentTest.LongInput regression test for strings > 2114 bytes

Co-authored-by: yanyiwu <[email protected]>
Copilot AI changed the title [WIP] Fix corrupted size vs. prev_size error in jieba.cut Fix heap corruption for strings larger than 2114 bytes Mar 5, 2026
@yanyiwu yanyiwu marked this pull request as ready for review March 7, 2026 10:48
@yanyiwu yanyiwu merged commit a998b31 into master Mar 7, 2026
56 of 64 checks passed
@yanyiwu yanyiwu deleted the copilot/fix-corrupted-size-error branch March 7, 2026 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

corrupted size vs. prev_size

2 participants