Skip to content

[8.2] [MOD-11233] - fix: empty string token should be counted in byteOffset…#6973

Merged
JoanFM merged 2 commits into8.2from
backport-6938-to-8.2
Oct 7, 2025
Merged

[8.2] [MOD-11233] - fix: empty string token should be counted in byteOffset…#6973
JoanFM merged 2 commits into8.2from
backport-6938-to-8.2

Conversation

@JoanFM
Copy link
Collaborator

@JoanFM JoanFM commented Oct 3, 2025

Note

Ensures empty-string tokens are recorded in global byte offsets for accurate highlighting, updates forward-index token function signature, and adds highlight tests.

  • Highlighting/Byte Offsets:
    • Always record token byte offsets via writeByteOffsets in src/document.c, including empty tokens, to maintain correct cross-field positions for highlighting.
    • Defer indexesEmpty check until after writing offsets; empty tokens are still skipped for indexing when appropriate.
  • Forward Index API:
    • Change forwardIndexTokenFunc signature to int forwardIndexTokenFunc(ForwardIndexTokenizerCtx *tokCtx, const Token *tokInfo) and document it; remove offset writing from this function.
    • Update declaration in src/forward_index.h accordingly.
  • Tests:
    • Add tests/pytests/test_highlight.py covering highlighting across empty strings, missing fields, array cases, field order, and INDEXEMPTY behavior.

Written by Cursor Bugbot for commit 7cdb3ea. This will update automatically on new commits. Configure here.

…s for highlighting (#6938)

* initial commit

* add newline at end of file

* add debug prints

* test: fix testing index being awaited

* add test from the MOD-11233

* test: simplify tests to be easier to handle/debug

* test: update test

* test: improve test assertion

* add test with some debug prints

* fix: fix tokenization empty

* remove useless field from tokenizer

* fix: simpler fix

* fix: improved behavior for forward index

* fix spell check

* change test name and var name

* change test text

* refactor: byteoffset handling independent

* test: add more tests with highlighting and empty combination

* apply changes as per code review

---------

Co-authored-by: jonathan keinan <[email protected]>
(cherry picked from commit 521847a)
@JoanFM JoanFM requested a review from kei-nan October 3, 2025 14:44
@github-actions github-actions bot added the size:M label Oct 3, 2025
cursor[bot]

This comment was marked as outdated.

while (0 != aCtx->tokenizer->Next(aCtx->tokenizer, &tok)) {
// We always want to write the byte offset, even when string is empty since it is global across all fields and
// we need to know the start position of the next field. This is required for highlighting.
writeByteOffsets(&tokCtx, &tok);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Memory Leak from Skipped Empty Tokens

Skipping empty tokens can lead to a memory leak. When an empty token is encountered and the continue statement is hit, the cleanup for tok.allocatedTok is bypassed, leaving allocated memory unfreed.

Fix in Cursor Fix in Web

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the tokenizer, if tokLen=0 it means that in the normalizer dstLen cannot be larger than the original, and then no more memory is allocated for this to be handled here.

@codecov
Copy link

codecov bot commented Oct 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.44%. Comparing base (e9db578) to head (7cdb3ea).
⚠️ Report is 2 commits behind head on 8.2.

Additional details and impacted files
@@            Coverage Diff             @@
##              8.2    #6973      +/-   ##
==========================================
- Coverage   89.47%   89.44%   -0.03%     
==========================================
  Files         253      253              
  Lines       41663    41665       +2     
  Branches     3725     3725              
==========================================
- Hits        37278    37269       -9     
- Misses       4336     4347      +11     
  Partials       49       49              
Flag Coverage Δ
flow 82.04% <100.00%> (-0.15%) ⬇️
unit 47.58% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JoanFM JoanFM added this pull request to the merge queue Oct 6, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 6, 2025
@JoanFM JoanFM added this pull request to the merge queue Oct 6, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 6, 2025
@JoanFM JoanFM added this pull request to the merge queue Oct 6, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 6, 2025
@JoanFM JoanFM added this pull request to the merge queue Oct 6, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 6, 2025
@JoanFM JoanFM added this pull request to the merge queue Oct 7, 2025
Merged via the queue into 8.2 with commit 0dd738f Oct 7, 2025
19 checks passed
@JoanFM JoanFM deleted the backport-6938-to-8.2 branch October 7, 2025 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants